A FRAMEWORK · OPERATORAI
The Evidence Stack
Most CRO programmes optimise for 'what can we test' instead of 'what should we test.' The Evidence Stack is the prioritisation rubric that closes the gap between AI volume and revenue impact.
The five layers
Layer 1. Best practice / playbook
The lowest-quality evidence: 'this works on most ecom sites' or 'Baymard recommends X.' Useful as a starting point for hypothesis generation. Not enough to ship an experiment by itself.
A self-serve AI tool's hypothesis output is almost entirely Layer 1. Most CRO blog content is Layer 1. Most agency proposals are Layer 1.
Operator decision: Layer 1 evidence on its own gets you a hypothesis card. It does not get you sample-size budget.
Layer 2. Heuristic UX review
A trained pair of eyes audits the page against UX principles, conversion-friction patterns and accessibility issues. Better than Layer 1 because it's specific to the page, not a generic best practice.
Most 'expert review' CRO services stop here. The audit lists 30-50 issues, the agency invoices, and ~80% of the issues never get tested. Without quantitative evidence (Layers 3-5) the operator has no way to prioritise.
Operator decision: Layer 2 evidence is enough to queue a hypothesis. It's not enough to ship one ahead of stronger evidence.
Layer 3. Quantitative behavioural data
This is where most CRO programmes start to do real work. Three sources, in order of weight:
- Heatmaps + session recordings: directional. Shows where attention dies and where rage-clicks happen.
- Funnel analytics: quantitative. Shows which step is leaking the most volume.
- Search-bar / on-site search query analysis: high-leverage. Tells you what shoppers are looking for and can't find.
Operator decision: Layer 3 evidence on a high-traffic page bumps a hypothesis into the top quartile of next-quarter's queue. Pair with Layer 1 + Layer 2 and the test usually ships.
Layer 4. Voice-of-customer / qualitative research
Direct evidence from real customers: surveys, exit-intent polls, post-purchase emails, support-ticket text analysis, sales-call transcript review. Higher weight than Layer 3 because it captures intent and language, not just behaviour.
The 90/10 rule: 90% of CRO programmes don't do Layer 4 because it's slower and less scalable than running another heatmap. The 10% that do consistently outperform on test hit rate, because the language in shipped variants is borrowed directly from how customers actually describe the product to themselves.
Operator decision: Layer 4 evidence promotes a hypothesis straight into the next sprint's experiment queue. Layer 4 is also the source of most surprise-winners (effect sizes 2-5× larger than predicted) because the language genuinely resonates.
Layer 5. Prior tests on the same store
The heaviest evidence. A prior test on this store, on this audience, on this product mix, that produced a documented result.
If you've run 50 disciplined tests over a year, you have 50 data points about what your specific customer responds to. New hypotheses get scored against that history before they ship. A hypothesis that contradicts a prior strong result either gets re-tested with refined design (because something has changed) or gets killed (because the evidence already settled the question).
Operator decision: Layer 5 evidence is decisive. It either rules a hypothesis in (because it's an extension of a known winner) or rules it out (because it's been tried and lost).
The reason operator-led programmes ratchet over time, and AI-only programmes plateau, is Layer 5. AI tools don't carry forward an account-specific memory at the depth required. Operators do, via the failure log.
How the stack scores a real hypothesis
A worked example, drawn from a 2024 client engagement (client name redacted, sector: kitchenware DTC, ~£420k/mo revenue).
Hypothesis: Replace the 'Add to bag' button on PDP with 'Add to kitchen' to match the brand voice and reduce purchase friction.
| Layer | Evidence | Score |
|---|---|---|
| 1 | Generic best-practice: button copy variations are commonly tested | weak (+1) |
| 2 | Heuristic review: button copy is generic, doesn't match elsewhere on site | medium (+2) |
| 3 | Heatmap shows abnormally high cursor hover-time on the button before click. Funnel shows below-benchmark PDP-to-cart. | strong (+3) |
| 4 | Post-purchase survey responses repeatedly use the phrase 'stocking my kitchen' / 'filling my kitchen.' | very strong (+4) |
| 5 | Related test (warmer brand voice in checkout) won 8.4% conversion-lift two quarters earlier. | extension of a winner (+3) |
Total Evidence Stack score: 13/20.
This was the highest-scoring hypothesis in that quarter's queue. It was tested first. Result: 11.7% PDP-to-cart conversion lift, sustained at 90 days, AOV unchanged.
A self-serve AI tool would have ranked the same hypothesis around #40 in its queue (button-copy is statistically a low-priority change in the AI's training data). Without the stack, it doesn't ship. Without it shipping, the £63k/year of recovered revenue isn't recovered.
What the stack rules out
Equally important. The stack is asymmetric: heavy evidence ships, weak evidence kills.
Common hypotheses that score low and shouldn't ship:
- 'Move the trust badges higher': Layer 1 only. Ships if Layer 3 confirms scroll-depth issue. Otherwise queue or kill.
- 'Add a chat widget': Layer 1 + sometimes Layer 2. Almost never enough Layer 3 evidence to justify the load-time cost. Usually killed.
- 'Change the hero image': Layer 1 + Layer 2. Without Layer 4 voice-of-customer evidence about what the hero is failing to communicate, this is a coin-flip test. Queue, don't ship.
- 'Test a darker CTA colour': Layer 1 only. Effect size, when winners exist at all, is in the 0.5-2% range. Below the bar for our 99% confidence threshold at most stores' traffic levels. Killed.
The Evidence Stack's most important function is letting the operator confidently kill weak hypotheses without offending the team that proposed them. The rubric carries the weight of the kill decision; the operator just applies it.
How to apply the stack to your own programme
Three concrete starting moves:
1. Score your last 10 shipped tests retrospectively. For each, write down which evidence layers were behind it. Most stores find their average is Layer 1-2. The top 20% of shippers average Layer 3-4. The compounding programmes average Layer 4-5.
2. Re-rank your current backlog. Pull every test idea in your queue. Score each against the stack. Anything below 6/20 either gets killed or sent for more research before it can ship. Most queues lose 60-70% of their items at this step. That's the rubric working.
3. Start a Layer 4 source today. A one-question post-purchase email ('In your own words: what does this product help you do?') generates more usable testable copy in 30 days than any heatmap will in a quarter. I've yet to find a store where this didn't pay back inside 90 days.
The Evidence Stack isn't the whole methodology
It's one of three named components of OperatorAI:
- The Evidence Stack, what counts as evidence, in what order (this page).
- The 99 Rule, the statistical-significance discipline that protects the wins.
- The 4-to-34 Gap, why operator-led AI outperforms self-serve by 4-7×.
The 4-to-34 Gap is the diagnosis. The 99 Rule is the discipline. The Evidence Stack is the prioritisation rubric. Together they're how an operator-led programme ratchets compounding lift across a year instead of plateauing inside a quarter.


