FRAMEWORK

The 99 Rule: Why GoGoChimp Tests at 99% Statistical Significance

The 99 Rule is GoGoChimp's discipline of declaring CRO winners at 99% statistical significance, cutting the false-positive rate from 1-in-20 to 1-in-100. Get the audit.

One sentence: The 99 Rule is GoGoChimp's CRO testing standard. Declare A/B winners at 99% statistical significance, not the industry-default 95%, because the false-positive rate drops from 1-in-20 to 1-in-100 and a programme running 30+ tests per quarter cannot afford the difference.

I named it because it needs a name. The discipline is older than the brand, the maths is older than CRO itself, but most agencies still ship at 95% because the software defaults to it and the wins read faster on the dashboard. The 99 Rule is what I do instead.

"Tests at 99% statistical significance (stricter than the 95% most agencies use)", GoGoChimp methodology principle, codified across thirteen years of operator engagements

This page is the standalone reference for The 99 Rule. It also lives inside The Evidence Stack as Layer 3, the stopping rule. It earns its own page because the discipline is specific enough, defensible enough, and contrarian enough to deserve one.

The maths in plain English

A p-value is the probability you'd see the result you saw if the variant actually did nothing. P = 0.05 (which is what 95% significance means) is shorthand for: "if this variant is genuinely identical to the control, there's a 5% chance the dashboard would still show me a result this strong, by accident."

5% is 1-in-20. That's the false-positive rate baked into 95% significance.

P = 0.01 (which is 99% significance) is 1-in-100. Same logic, stricter threshold. Run 100 tests at 99% confidence and the maths expects one false positive in the batch. Run 100 tests at 95% confidence and the maths expects five.

Concrete example. A Shopify product-page test runs for fourteen days. The control converts at 3.10%. The variant converts at 3.31%. The statistical engine (a Frequentist two-proportion Wald test, the standard for VWO, Convert, Optimizely, AB Tasty) returns p = 0.04. At 95% significance that's a winner. At 99%, it isn't, and the test keeps running until the data settles one way or the other.

The Bayesian equivalent (which AB Tasty and some VWO setups use by default) reports a "probability to beat control." 95% probability-to-beat is the rough Bayesian analogue of 95% Frequentist significance. The two methods don't return identical numbers on identical data, but the directional logic is the same. The 99 Rule applies at the Bayesian threshold of 99% probability-to-beat, not the conventional 95%.

The point isn't which method you run. The point is where you draw the line.

Why most agencies stop at 95%

The 95% threshold isn't arbitrary. It's a hundred-year-old academic convention, popularised in Ronald Fisher's Statistical Methods for Research Workers (1925), which became the default p-value cutoff for life-science publication and never really left. Every undergraduate stats course teaches it. Every CRO platform ships with it pre-selected in the dashboard.

There are three reasons agencies don't push past it.

Software defaults. VWO, Convert, Optimizely, and AB Tasty all default to 95%. Some don't let the operator change the threshold without dropping into advanced settings. If the agency's process is "use what the tool says," the threshold is 95% by inheritance, not by argument.

Faster wins, faster invoicing. A 95% test reaches significance with roughly half the sample size of a 99% test. Faster reads mean more tests shipped per month, which means more "wins" in the monthly client report. The optics favour 95%. The reality is that some of those wins are noise, but noise doesn't show up in the monthly report. It shows up six months later when the deployed variant fails to replicate.

Engagement size hides the cost. A single-test engagement at 95% has a 5% chance of shipping a false positive. That's not great but it's survivable. A 30-test programme at 95% expects 1.5 false positives. A 120-test annual programme expects six. The compounding only bites once the programme clears roughly 20 tests per quarter, which is the volume threshold below which 95% looks fine and above which it bleeds revenue invisibly.

The 95% convention isn't villainy. It's a sensible default for low-volume testing that becomes structurally inadequate at portfolio scale. The 99 Rule is the structural fix.

What 99% costs you in volume

Stricter significance has a price. The critical-value ratio between 99% and 95% confidence (z₀.₀₁ = 2.576 vs z₀.₀₅ = 1.96) means required sample size scales by roughly 1.7× at the same minimum detectable effect. In GoGoChimp's client work, that translates to test durations 30-50% longer at 99% than at 95%. A test that would have called a winner in eighteen days at 95% calls it in twenty-five to twenty-eight days at 99%.

That's an acceptable trade if and only if test volume is high enough that the false-positive savings compound. On a single-test engagement, 95% is fine. The maths favours speed because there's no portfolio to protect.

On a 30+ tests-per-quarter programme (GoGoChimp's Growth and Scale tiers), the maths inverts. Run 120 tests per year at 95% and you expect to deploy six false positives. Each one is a deployed change that the dashboard rewarded but the underlying behaviour didn't actually support. Each one erodes downstream revenue when its supposed lift evaporates, and each one contaminates the next hypothesis because the operator is now reasoning from a "winner" that wasn't real.

Run the same 120 tests at 99% and the expected false-positive count drops to roughly one. Five fewer ghosts in the deployment graph. Five fewer corrupted baselines for the next test to fight against.

The Slow Reveal cost is the additional 30-50% test duration. The Slow Reveal benefit is that the wins you do declare are real. Across a year of programme work, this is the difference between a portfolio that compounds and a portfolio that drifts sideways while everyone is busy.

Enzymedica: three compounded wins, all declared at 99%

Enzymedica UK ran three compounded CRO wins across the 2021 Black Friday engagement. The baseline conversion rate was 3.4%. Black Friday 2021 closed at 16.9% (a 4.97× lift on baseline, and a 2.4× lift on the prior year's Black Friday at the same promo intensity). December 2021, the worst month of the year for health supplements, sustained at 11%. Loom analytics review on file.

All three winners were declared at 99% significance. None of them was the fastest test in the engagement. All three replicated when deployed and stayed replicated through the December trough, which is the harder test. A Black Friday spike on heavy promo traffic is one thing. The same conversion behaviour holding up six weeks later on cold December traffic is another.

That's the receipt for the 99 Rule on a single engagement. A 2.4× lift on the same promotional day, same product catalogue, year over year. An 11% sustained rate through a month that historically punishes supplement brands. The discipline produces wins you can trust six months later, which is the only kind of win worth deploying.

When NOT to use the 99 Rule

The 99 Rule isn't a universal good. It has structural costs and there are situations where 95% (or even 90%) is the better operator call.

Low-volume engagements. If the engagement is a single landing-page test or a one-off Sprint, the maths favours speed. The false-positive cost of one test is one test. Use 95%.

Exploratory tests where direction matters more than magnitude. When the test question is "does this hypothesis even point in the right direction?" rather than "by how much?", 90% significance is the right operator call. The point is signal direction, not deployment. Ship the next test based on what this one taught you, don't ship the variant itself.

Cost-of-missing-a-small-winner exceeds cost-of-false-positives. Edge case, but it happens. A high-traffic page where a 1% lift would be worth £200K/year and false positives are cheap to revert. The maths here may favour the lower threshold. The 99 Rule is calibrated for the typical Growth or Scale engagement, not every test in every context.

The discipline is a tool, not a religion. Knowing when not to use it is part of being the operator who decides.

How to implement The 99 Rule in your testing programme

Three first steps for any operator wanting to move their programme to 99% confidence:

  • Change your tool's default threshold. In VWO, Convert, Optimizely, or AB Tasty, locate the statistical significance setting (Frequentist) or the probability-to-beat-control threshold (Bayesian). Set it to 99% / 0.01. Save as the workspace default, not just per-test.
  • Rerun your sample-size calculator with the new threshold. Required samples roughly double. Brief stakeholders that test durations are increasing by 30-50%. The wins they ship will be more trustworthy, but the calendar moves slower.
  • Audit your last twelve months of deployed wins. Pull every deployed variant from the last year. For each one, ask: did the underlying behaviour actually shift the way the test predicted? The variants where the post-deployment data doesn't match the test data are your candidate false positives. Estimate how much they've cost you. Bring that number to the next stakeholder meeting.

Read next

References

Ready to audit your CRO programme?

Book my free AI audit
© 2026 GoGoChimp. All rights reserved. Call: 0141 463 6875 - Address: 8 Cheviot Drive, Newton Mearns, Glasgow, G77 5AS
Nominated — Digital Doughnut Digital Marketing Agency of the Year 2021
Shopify Partner — GoGoChimpBRONZEklaviyoK:PARTNERS