AI CRO
The ICE framework is broken. Here's what to use instead for A/B test prioritisation
Last updated: [Updated Date]
Every CRO team eventually encounters ICE scoring. Impact, Confidence, Ease, score each out of 10, multiply, prioritise descending. The appeal is obvious: a clean number you can defend in meetings.
The problem is that all three inputs are subjective, and in practice, whoever presents the hypothesis also scores it. I've sat through dozens of ICE-scored test backlogs where every advocate scored their own hypothesis 9/10/9 and nothing was ever deprioritised.
The three-part framework that replaces ICE
This is the prioritisation core of our methodology. Three inputs, weighted by evidence, not advocacy.
1. Evidence weight (0–5)
What do we actually know? A session recording showing 30% of users abandon on this step = 4. A past test on a similar page that won at 9% lift = 5. Someone's hunch = 0. No hypothesis should enter the backlog without at least 2 points of evidence weight.
2. Ceiling estimate (percentage)
If this variant wins, what's the realistic upper bound on the lift? Based on comparable tests, industry benchmarks, or Bayesian priors from prior tests. A 2% ceiling test on high-traffic is worth more than a 20% ceiling test on a low-traffic page.
3. Traffic-adjusted runtime (days)
Run the sample-size maths before prioritising. If your traffic produces significance in 9 days, ship it. If it requires 47 days, either increase traffic, accept the opportunity cost, or deprioritise.
How this changes the backlog
Tests with no evidence weight go to the bottom regardless of how confident the advocate feels. Tests with 30-day+ runtimes get staged later than faster tests of similar ceiling. Advocacy stops winning. Evidence wins.
Where this fits in the OperatorAI methodology
This article sits under The Evidence Stack, one of the three named frameworks inside our OperatorAI methodology (GoGoChimp's CRO methodology, distinct from OpenAI's Operator agent product). GoGoChimp's four-layer testing discipline — operator-set hypothesis, sample-size discipline, The 99 Rule, and failure-as-information.
For where this work sits in our operating-model maturity classification, see The OperatorAI Maturity Model — the five-tier framework from Ad-hoc through Operator-Led.
Want us to do this for your site?
Book a free AI audit. 15 minutes. We’ll show you three things your site is missing and what we’d test first.
Book my free AI audit →



