FRAMEWORK

The Evidence Stack: GoGoChimp's CRO Testing Discipline

The Evidence Stack is GoGoChimp's four-layer CRO testing discipline: operator-set hypothesis priority, sample-size discipline, the 99 Rule, failure-as-information. From the 13-year Glasgow operator behind Enzymedica's 3.4% to 16.9% Black Friday. Book a free AI audit.

What is The Evidence Stack?

The Evidence Stack is the four-layer testing discipline I run on every GoGoChimp engagement. Operator-set hypothesis priority. Sample-size discipline. The 99 Rule. Failure-as-information. Each layer protects against a specific failure mode. Together they compound into the 28-34% expert-guided AI lift documented in Build Grow Scale's 2026 review of 347 e-commerce stores.

It's named because vocabulary matters. When a buyer asks "what does GoGoChimp actually do differently?" the honest answer isn't "we run A/B tests." Every agency runs A/B tests. The answer is The Evidence Stack: a documented protocol that gates four specific operator decisions on every test, on every client, every time. The naming is what makes it teachable, auditable, and defensible.

"The AI is not the differentiator. The operator is. The Evidence Stack is how the operator earns the differential."

Layer 1: Operator-Set Hypothesis Priority

I pick what to test. Not the AI tool's suggestion engine. Not the dashboard's "test ideas" tab. Not the loudest opinion in the planning meeting. A trained operator with 13 years of pattern recognition, anchored on a documented audience hypothesis (who is buying and why, not "what's a best practice"), decides which test runs next.

Every test starts in the same format. If we change X, the conversion rate for Y audience will move because Z mechanism. No hypothesis, no test. Hypotheses rank against expected revenue impact, not against implementation ease. A 30-minute button test on a £200K monthly funnel beats a three-day pricing-page rebuild on a £20K funnel, every time.

This is the layer that creates the 4-to-34 differential. Build Grow Scale's 2026 research across 347 stores found that expert-guided programmes hit 28-34% average lift while self-serve AI tools cap at 4-7%. Same software, in some cases the same VWO or Optimizely account. The structural difference is hypothesis quality, and hypothesis quality is what Layer 1 protects.

DIY AI tools test the surface because that's where their training data lives: hero-image variations, headline tweaks, CTA-colour swaps. Real test categories, real wins, but capped at 4-7% because surface tests have a surface ceiling. Structural tests (audience segmentation, multi-step funnel restructure, customer-interview-derived hypotheses) live in the 28-34% band and require operator judgement to identify. Layer 1 is the gate.

Layer 2: Sample-Size Discipline

Before any test launches, the minimum sample size is computed. Baseline conversion rate, minimum detectable effect, statistical power (typically 80%), confidence threshold (99%, per Layer 3). The number that comes out is the test's gate. The test cannot be read until that gate is hit. No exceptions. No "let's just peek." No "the trend looks strong, let's call it."

This kills the failure mode that costs the most revenue: shipping false positives. A dashboard at day three showing variant B beating control by 12% with a 78% probability-to-beat looks like a winner. It is not a winner. It is an under-powered read of a small sample, and the false-positive rate on tests called early can climb above 26% even when the dashboard claims 95% confidence (Miller, 2010). Stopping early is how programmes ship noise as truth and watch the baseline drift.

For under-traffic clients (under £100K/month revenue, low session volume) the protocol changes but does not relax. 95% threshold plus 1,000 unique sessions per variant minimum plus an independent confirmation signal (a heatmap pattern, a customer-interview verbatim, a survey response). The finding gets logged as directional, not as a winner. The vocabulary distinction matters. Directional findings inform the next hypothesis. Winners get deployed to 100% of traffic.

Sample-size discipline is the layer that makes Layer 3 possible. You cannot apply 99% statistical significance to a test that was never powered for it.

Layer 3: The 99 Rule (Stopping Rule)

A test is declared a winner only at 99% statistical significance. Not 95%. The 4% gap matters at programme scale, and it's the discipline the industry skips because the trade-off is uncomfortable: a 99% test needs around 50% more traffic and 50% more time than a 95% test.

The maths is the argument. 95% confidence means a 1-in-20 chance the observed lift is noise. 99% means 1-in-100. A 120-test annual programme run at 95% produces roughly six false positives deployed as winners. The same programme at 99% produces roughly 1.2. Five times fewer false positives compounding into the baseline. Over 24 months, the difference between a programme that grows revenue and a programme that ships noise and reports it as growth.

The 99 Rule has its own dedicated page at /framework/99-rule. Within The Evidence Stack, it's Layer 3 because the prior two layers are prerequisites. Operator-set hypotheses produce tests worth gating. Sample-size discipline produces tests that can be gated at 99%. Without Layers 1 and 2, Layer 3 is a slogan instead of a discipline.

Every CRO tool I use (VWO, Convert, AB Tasty, Optimizely) defaults to 95%. The override is manual on every test. That manual override is the discipline.

Layer 4: Failure-as-Information

A losing test is not a failed experiment. It is a confirmed hypothesis that the lever you tested isn't the lever to pull on this audience. The signal is in the loss, and the industry throws it away.

Every losing test in a GoGoChimp engagement is logged with a hypothesis-failure-mode tag. Hypothesis X assumed Y. Result was Z. The mechanism that didn't fire was [specific behaviour]. The next hypothesis on the same surface (the same page, the same audience segment, the same funnel step) is informed by the failure-mode of the previous one. Pattern recognition accumulates across the engagement instead of restarting from intuition every quarter.

Quarterly reviews surface failure-mode clusters. If three losing tests on the pricing page all assumed first-time visitors anchor on price, the cluster surfaces a deeper assumption to interrogate directly: maybe price isn't the anchor at all on that page, and the next test should target a different mechanism entirely. The failure log is the most valuable artefact in the engagement library after 12 months.

DIY AI tools default to discarding losing variants and moving on. The losing variant is "deprecated," the test marked "no winner," and the dashboard suggests three new hero-image variations. The most informative single data point in the test cycle gets binned because nobody asked it the right question. Layer 4 is what asks the right question.

How the four layers compound

The compounding is multiplicative, not additive. Layer 1 alone gives you better hypotheses, but without Layer 3 you'll deploy false positives. Layers 1 and 2 give you better hypotheses run to proper sample size, but without Layer 3 you'll still drift at 95%. Layers 1 through 3 give you durable winners, but without Layer 4 your next hypothesis is intuition again instead of pattern recognition. All four running in sequence produce the 28-34% expert-guided AI band. Skip any one and the programme settles into the 4-7% DIY band regardless of which platform you've subscribed to.

Where The Evidence Stack ran end-to-end

Enzymedica UK (Shopify), December 2021. Three compounded CRO wins across a 30-day window during the worst calendar month for health-supplement sales. Every hypothesis was operator-set against a documented revenue-impact ranking. Sample size was hit before any read. The 99 Rule gated every winner-call. Failure-modes from earlier December tests informed the Black Friday hypotheses directly. Outcome: 3.4% baseline to 16.9% on Black Friday (a 4.97x lift on the highest-stakes day of the year) and 11% sustained through December. The prior year's Black Friday converted at around 7%, which means the 16.9% is a 2.4x lift on the same promo day with the same product line. Three compounded CRO wins, not a single-day spike. Loom analytics review on file (Enzymedica analytics walkthrough).

BeeFRIENDLY Skincare (Ezra Firestone brand), 2017. The Evidence Stack applied to a page-speed-driven CRO engagement. Layer 1 picked image-weight reduction on the highest-traffic landing pages, hypothesised against revenue-per-visitor. Layers 2 and 3 gated sample size and the 99% confirmation before deployment. Layer 4 logged the unsuccessful pre-fix variants into the engagement's hypothesis library. Outcome: bounce rate 82.04% to 38.4%, per-visitor value $1.28 to $29.03, annual revenue $48,000 to $1,447,225 (a ~30x revenue multiplier). Numbers held for at least six months. Public case study video (client anonymised in the public version).

These outcomes are the documented evidence of The Evidence Stack working as a system on engagements that the wider industry has reviewed and cited. Chris's named-source authority on the relevant disciplines has been picked up by industry coverage like CMO Times' Q&A on lead-capture conversion choices. The point of the framework page is to make the underlying discipline auditable so the outcomes are repeatable, not just attributable.

What The Evidence Stack is NOT

It is not ICE, PIE, or PXL. Those are prioritisation scoring rubrics that rank test ideas on dimensions like Impact, Confidence, and Ease. Useful inside Layer 1 as one input. They do not address sample-size discipline, statistical significance threshold, or failure-mode logging. A programme running ICE alone is running 25% of The Evidence Stack and calling it a methodology.

It is not a single-page worksheet or a Notion template. The four layers are operator decisions made repeatedly across an engagement, not boxes ticked once in onboarding. The discipline lives in the override that happens on every test (manually overriding the platform's 95% default, manually refusing to read the dashboard before sample size, manually documenting why a loss lost). A template is a memory aid for a discipline that already exists. It is not the discipline.

It is not the same thing as OperatorAI, my CRO methodology (distinct from OpenAI's Operator agent product). OperatorAI is the master-brand methodology, the operator-led-AI-CRO architecture. The Evidence Stack is the testing-discipline engine inside OperatorAI. The 4-to-34 Gap is the outcome the engine produces. The 99 Rule is Layer 3 of the engine. Four named entities, one architecture, deliberately layered so the buyer-side vocabulary matches the operator-side reality.

How to apply it to your programme

If your current testing programme defaults to 95% significance, doesn't document failure-modes, runs hypotheses generated by the dashboard rather than the operator, and reads results before sample size is hit, you are running the DIY mode of CRO regardless of which platform you've subscribed to. The 4-7% lift band is the honest expectation. The shift into the 28-34% band is not a software upgrade. It is four deliberate overrides of the dashboard default, applied to every test.

Three first steps that don't require an agency engagement:

  • Set 99% as your default winner-call threshold in your testing platform's project settings today. VWO, Convert, AB Tasty, Optimizely all support it. The override takes ninety seconds and is the single highest-leverage change in this list.
  • Open a "Failure log" document and back-fill it with your last 10 losing tests. Hypothesis, assumption, result, what didn't fire. The pattern that emerges in the back-fill is a sharper next hypothesis than anything the dashboard would surface.
  • Calculate the minimum sample size for your next test before you launch it. Use any free power calculator (Optimizely's, VWO's, Evan Miller's). Write the number on the test brief. Do not read the dashboard until the number is hit.

If after the first three steps you want the full discipline applied to your programme with a senior operator on every engagement, book a free AI audit and we'll show you what the gap is costing in revenue on your current baseline. The audit is the qualifier. It tells you whether The Evidence Stack is worth implementing in-house, with us, or not at all.

Read next

  • The 4-to-34 Gap, the outcome the four layers produce (the 5x lift differential between DIY and expert-guided AI CRO)
  • The 99 Rule, Layer 3 broken out as a standalone framework with the full statistical-significance argument
  • OperatorAI methodology, the master-brand framework that wraps The Evidence Stack, The 99 Rule, and The 4-to-34 Gap
  • Book a free AI audit, the qualifier that tells you whether The Evidence Stack is worth implementing on your programme

References

Ready to audit your CRO programme?

Book my free AI audit
© 2026 GoGoChimp. All rights reserved. Call: 0141 463 6875 - Address: 8 Cheviot Drive, Newton Mearns, Glasgow, G77 5AS
Nominated — Digital Doughnut Digital Marketing Agency of the Year 2021
Shopify Partner — GoGoChimpBRONZEklaviyoK:PARTNERS