A FRAMEWORK · OPERATORAI

The 99 Rule

The CRO industry tests at 95%. I test at 99%. Here's why, and what it costs you when an agency calls a winner before the data does.

What's actually wrong with 95%

The 95% threshold isn't a CRO standard. It's a 1925 statistical convention from R.A. Fisher's agricultural experiments, transplanted into web testing because someone needed a number on the slider.

It made sense for crop-yield trials with large sample sizes and slow consequences. It does not make sense for a CRO programme where:

A single agency ships 30+ tests a quarter across multiple accounts
The buyer reads 'winner' and immediately rolls it out, sometimes to seven-figure traffic
A false positive doesn't just fail to lift revenue, it actively reverses revenue trends, often invisibly

If you ship 30 tests at 95% confidence in a quarter, you should statistically expect 1 to 2 false positives rolled out as winners. If you ship 30 tests at 99% confidence, you expect 0 to 1. Across a year, that's the difference between 6 quietly-broken winners on your site and 1.

I ran the maths on my first 100 tests, 2017-2019. Of every ten 'winners' called at 95%, three either reverted to control within 90 days or showed no measurable revenue impact when isolated. I changed the rule.

How the 99 Rule changes my test design

Three concrete differences, day-to-day:

1. Sample size up, test count down.

At 99% with the same baseline conversion rate and minimum detectable effect, you need ~60% more visitors per arm. Most agencies hide this from buyers by running underpowered tests. I don't. I tell you upfront what sample size each test needs and how long that takes at your traffic level. If a test needs 14,000 visitors per variant and you have 4,000 per week, that's a four-week test, not a one-week pilot. I document it. You sign off before we start.

2. Stopping rules are written before the test ships.

'Peeking', checking the test mid-run and stopping when it crosses 95%, inflates false-positive rates dramatically. Some studies put real false-positive rates from peeked-at 95% tests at 25%+. I don't peek. Each test ships with a pre-registered sample-size target. The dashboard does not show 'winner declared' until the target is hit.

3. Failure is logged with the same rigour as success.

A losing test isn't a wasted experiment. It's the strongest evidence I have about what doesn't work for your audience. Every test, winner, loser, or inconclusive, gets a one-line entry in your account's experiment log. Over 12 months that log becomes the most valuable artefact on your CRO programme. It's why my retainer clients keep ratcheting their hit rate up over time. The log compounds.

What the 99 Rule costs you

Speed. That's the only honest cost.

A test that would have called a winner at 95% in 12 days takes me 19 days. A four-week sprint becomes a six-week sprint. Quarterly throughput is roughly 30 experiments instead of 50.

The trade looks bad if you measure CRO programmes by tests-shipped (most do, it's the easiest metric for an agency to brag about). It looks excellent if you measure by revenue actually lifted, sustained, and rolled forward into next quarter's baseline.

I measure by the second one. Every monthly revenue-impact report shows the £/month each shipped winner is currently generating, three months after roll-out. Winners that don't sustain get flagged and re-tested. None of this works at 95%. The discipline starts with the threshold.

When I break my own rule

Three exceptions, declared up-front:

1. Diagnostic tests.

When I'm trying to identify which segment is driving an anomaly (e.g. mobile vs desktop, paid vs organic), I run at 90% confidence because the goal is direction-finding, not winner-calling. The result feeds into the next high-confidence test. I label these clearly in your dashboard.

2. Catastrophic-loser detection.

If a variant is losing badly inside the first 1,000 sessions per arm, I'll call it dead before sample-size is reached. The asymmetry: a 99% confident loser at small sample is easy. A 99% confident winner at small sample is statistically impossible without cheating. I only break the rule downward.

3. Holiday windows.

Black Friday, Cyber Monday and Boxing Day compress the calendar. I'll run pre-tested winners (already cleared 99% in earlier traffic) through these windows but not call new winners during them. The traffic is non-representative; confidence in this period's data shouldn't propagate forward. I caveat the report.

How to verify your current agency tests at 99%

Three questions that flush out the answer in 30 seconds:

'What confidence threshold do your tests run at by default?' If they say 95%, or worse, 'we look at trend after a week', you have a peeking problem.
'How many of last quarter's winners would still be winners at 99% confidence?' A good agency will know. A bad one will say 'all of them' without checking.
'What's your sample-size pre-registration policy?' Pre-registered = test plans are locked before launch. Post-registered = the agency is fitting the test to whatever result happens. The first is science. The second is theatre.

If your agency can't answer all three within 30 seconds, the 99 Rule isn't being applied. Whatever lift number they're showing you, divide it by two before believing it.

The 99 Rule isn't the whole methodology

It's one of three named components of OperatorAI's testing discipline:

The 99 Rule, statistical significance threshold (this page).
The Evidence Stack, what counts as evidence and in what order.
The 4-to-34 Gap, why operator-led AI CRO outperforms self-serve AI by 4-7×.

Each compounds on the other. The 99 Rule on its own slows you down. The 99 Rule + The Evidence Stack + The 4-to-34 Gap together is why my average client lifts conversion 28-34% across a year, with shipped winners that hold up six months later instead of fading.