The CRO Primer for Founders: 15-Minute 2026 Guide

Q: How is CRO different from SEO or paid ads?

SEO and paid ads are acquisition disciplines. They bring traffic to the site. CRO is a conversion discipline. It takes the traffic you already have and converts more of it into revenue. The three work in sequence: SEO and ads fill the funnel, CRO closes more of it. Below 30,000 monthly sessions on the page you want to test, spend on acquisition first.

Q: When should I NOT invest in CRO?

Don't buy CRO if you're under £50K/month revenue, if your traffic is below roughly 30,000 monthly sessions on the page you want to test, if you've replatformed in the last 60 days, or if you don't have GA4 and Microsoft Clarity properly installed. Below those thresholds the maths doesn't work, and the right move is acquisition or analytics infrastructure first.

Q: Can I do CRO in-house?

Yes, if you have a product or growth function with bandwidth to manage testing weekly, the discipline to ship implementation when a variant wins, and the seniority to set operator hypotheses rather than letting the dashboard generate them. The starting points are defaulting your testing platform to 99% confidence, opening a failure log of your last 10 losing tests, and calculating minimum sample size before every launch. SaaS founders with a product team and ecommerce founders with a tech-led culture are the natural fit.

The 15-minute read for founders who want a real benchmark, not a sales pitch

Author: Chris McCarron, Founder, GoGoChimp
Date: 14 May 2026
Version: v2.0 (retargeted around benchmark + A/B testing discipline 2026-06-27)

If your reaction to "you should be running CRO" is the same as your reaction to the boring-but-effective work (yes, I should, but never quite get round to it), stop reading here. The rest of this primer is for the founders who already floss. The ones who suspect their funnel is leaking money and want a straight answer about two things: what counts as a good conversion rate in 2026, and how to run an A/B test that produces a trustworthy answer instead of a flattering one.

I'm writing this as a founder to other founders. Thirteen years of running conversion rate optimisation on real client revenue, not as an advisor pointing at slides. If you've spent any time on a CRO sales call recently, you know the routine. Twenty minutes of agency-speak. A "discovery audit" that costs four figures. A pitch deck full of generic best practice. You leave the call no closer to understanding what good actually looks like.

This primer is the benchmark + statistical-discipline crash course for founders and startup operators. By the end, you'll know what a good conversion rate is for ecommerce, SaaS, PPC and email sign-ups; whether 10% is realistic for your category; how many visitors you need to A/B test honestly; and why most CRO programmes fail in ways the dashboard hides.

Fifteen minutes. Nine sections. No discovery call.

Section 1: What CRO actually is (the short version)

Conversion rate optimisation is the discipline of taking the traffic you already have and converting more of it into revenue. That's the whole one-line definition. It's not growth hacking. It's not marketing. It's the engineering and behavioural-science work that happens after the traffic arrives.

Strip the jargon, and a serious CRO programme breaks down into six components: hypothesis design, audience evidence, sample-size discipline, statistical-significance threshold, implementation engineering, failure-mode logging. The tools are interchangeable. The disciplines are not.

I'm keeping this section deliberately light. If you want the full definitional treatment (what counts as a conversion, micro vs macro conversions, the history of the discipline, how the formula actually works), the pillar page is What is conversion rate optimisation? CRO explained for 2026. Read that first if you're starting cold. Come back here for the benchmark and the statistical-discipline bit.

The rest of this primer assumes you know what a conversion rate is. The question we're answering is whether yours is any good, and whether the tests you're running to improve it are actually trustworthy.

CRO doesn't fail because the discipline is hard. It fails because the discipline gets skipped and the dashboard hides the skip. Benchmark first, then sample-size discipline, then statistical significance. In that order.

Section 2: Why most CRO programmes fail (the four failure modes)

Most CRO programmes don't fail because the team is incompetent. They fail because specific disciplines get skipped, and the skip is invisible to the buyer until the baseline starts drifting in month nine.

I've audited more failing CRO programmes than I can count. The same four failure modes repeat. If a vendor's process doesn't gate against all four, the programme is structurally exposed to all four. This is the short list of conversion rate optimisation mistakes I see on every founder audit call, and the answer to "why isn't my CRO working" is almost always one of them.

Failure mode 1: Hypothesis poverty. The agency tests "best practices" instead of audience-anchored hypotheses. Test the CTA colour because Hubspot wrote a blog about red versus green in 2017. Test the hero image because the dashboard suggested it. Test the headline because the AI tool generated five variants and one of them sounded clever. None of those are hypotheses. They are guesses dressed up as experiments, and they're the deepest reason A/B tests fail at scale. If you're searching "why A/B tests fail" or "A/B testing mistakes", you're looking at the most common cause in the discipline.

A real hypothesis names the audience, names the mechanism, and names the expected revenue impact. If we change the homepage hero from a feature-led headline to a benefit-led headline for first-time mobile visitors, conversion will lift because the page currently fails to answer "what does this company do" in the first three seconds.

The deeper reason surface tests cap out is that they target the wrong layer of decision-making. Gerald Zaltman's Harvard Business School research found that up to 95% of purchase decisions happen in the subconscious, and the brain processes emotional stimuli roughly 3,000 times faster than rational thought (Zaltman, How Customers Think, 2003). A button-colour test is a rational-layer intervention. A trust-architecture or value-proposition test is a subconscious-layer intervention. Hypothesis quality is the ceiling. Everything above this layer is execution.

Failure mode 2: Sample-size impatience (the peeking problem). Winners get declared at day three. The dashboard pings with a "winner" notification, variant B is up 12% with 78% probability-to-beat, somebody screenshots it and posts it in Slack. The test gets called. The variant ships. The "win" enters the case-study deck. This is the most expensive A/B testing mistake in the discipline.

The problem: the test was under-powered. False-positive rates on tests called before minimum sample size can climb above 26% even when the dashboard claims 95% confidence (Miller, 2010). One in four "winners" is noise. Ship enough of them and the baseline drifts.

The formal academic foundation is Johari, Pekelis, and Walsh's 2017 KDD paper "Peeking at A/B Tests", which introduced the mixture sequential probability ratio test (mSPRT) methodology Optimizely's Stats Engine now implements. Bayesian alternatives (the default in AB Tasty, Statsig, and Eppo as of 2026) allow continuous monitoring under proper stopping rules without inflating false-discovery rate. Either approach is defensible. The 95%-fixed-sample-size-with-peeking pattern most platforms ship as default is the one that is not. We come back to the sample-size formula in Section 4.

Failure mode 3: 95% confidence shipping (when 99% is the right threshold). The industry default for statistical significance is 95%. Every CRO platform ships with 95% as the winner-call threshold. It sounds rigorous. It isn't, at programme scale.

The maths is the argument. 95% confidence means a 1-in-20 chance the observed lift is noise. A 120-test annual programme run at 95% produces roughly six false positives deployed as winners. The same programme at 99% produces roughly 1.2. Five times fewer false positives compounding into the baseline over 24 months. The difference between 95% vs 99% confidence isn't academic; it's the difference between a programme that compounds and a programme that quietly rolls itself back.

GoGoChimp runs every winner-call at 99%, the discipline we call The 99 Rule. The trade-off is real: a 99% test needs around 50% more traffic and 50% more time than a 95% test. The pay-off is also real. Over a 24-month engagement, the difference is the gap between a programme that grows revenue and a programme that ships noise and reports it as growth.

Failure mode 4: Failure treated as failure. A losing test is the most informative data point in the test cycle. The industry throws it away. The default behaviour in every DIY AI tool is to mark a losing variant "deprecated" and suggest three new hero-image variations. The loss carries the answer to a specific question (this lever does not move this audience), and the question never gets logged. Next quarter the same CRO expert tests a similar hypothesis on the same surface because the pattern from the previous loss was never written down.

A real CRO programme runs a failure log. Hypothesis. Assumption. Result. What didn't fire. The next hypothesis on the same page, the same audience, the same funnel step is informed by the failure-mode of the previous one. Pattern recognition accumulates across the engagement instead of restarting from intuition every quarter.

Hypothesis poverty. Sample-size impatience. 95% confidence shipping. Failure treated as failure. Those four explain almost every conversion rate optimisation failure I've audited in 13 years. Skip one and the programme drifts. Skip two and it stops growing. Skip three and it's theatre.

The four failure modes compound. One in four "winners" called early at 95% is noise (Miller, 2010). Run a 120-test programme at 95% with peeking and you've shipped roughly fifteen false positives a year. Run the same programme with sample-size discipline and a 99% threshold and you've shipped one.

Section 3: What is a good conversion rate? The 2026 benchmark table

This is the section every founder reads first and most blogs get wrong. "What is a good conversion rate?" is the most-Googled CRO question in 2026. The honest answer is: it depends on your traffic mix, your category, and what you're asking the visitor to do. The slightly-less-honest-but-actually-useful answer is the benchmark band below, drawn from primary-source data published by Unbounce, Google, Shopify, Baymard, and Statista.

Here's the rule before we get to the numbers: a good conversion rate is one that's above the median for your category and moving in the right direction quarter-on-quarter. A site converting at 1.4% in a category whose median is 1.1% is doing better than a site converting at 5% in a category whose median is 8%. The benchmark only matters as the reference line.

Ecommerce benchmark: what is a good ecommerce conversion rate?

Ecommerce DTC average sits at 2.5% to 3%. Across cross-vertical 2025-2026 datasets, ecommerce sites converted between 2.5% and 3% on average, with significant variance by sub-category. Apparel and accessories typically converted at 1.8% to 2.4%; beauty and health at 2.8% to 3.5%; consumer electronics at 1.4% to 2%. The 2.5-3% range is the global ecommerce median; if you're below it on consistent traffic, the floor is the place to start.

A good ecommerce conversion rate is 5% or higher. The 5% threshold is roughly the top quartile of DTC ecommerce stores in published benchmarks. If you're already at 5%+ on real traffic with a normal product range, you're materially out-performing your category and the next-best lever is usually average order value, not conversion rate. CRO doesn't stop at 5%, but the rate-of-return on rate optimisation starts to flatten and revenue-per-visitor becomes the more honest metric.

Is a 10% conversion rate good for ecommerce? Yes, top decile. A 10%+ ecommerce conversion rate is the top decile of stores and is realistic in three specific categories: high-repeat consumables with strong subscription mechanics (skincare refills, supplements, coffee), single-product narrow-niche stores (a hero product with no choice paralysis), and event/digital-product stores where the offer is time-bounded. 10%+ is unusual but not mythical. If you're targeting it on a 50-SKU mid-AOV apparel store with mixed traffic, you're going to be disappointed. If you're targeting it on a single hero SKU with retention-driven email traffic, it's a realistic ceiling. Enzymedica UK hit 16.9% on Black Friday 2021 across UK-only traffic; that was a category-friendly product (single-purpose health supplement, repeat-buyer base) on a CRO-disciplined engagement, not a generic ecommerce site.

The Baymard ceiling: 70.19% cart abandonment is normal. Even on a well-run ecommerce store, roughly 70% of carts get abandoned at checkout. Baymard's 2026 meta-analysis of 50 studies puts the global average at 70.19%, with mobile at 80.02% and desktop at 66.41% (Baymard, 2026). The two biggest drivers are extra costs at checkout (48% of abandoners) and "too long or complicated" checkout flow (18%). The single highest-impact CRO win on most ecommerce sites is shortening the checkout from Baymard's average 14.88 form fields to the 6-8 actually required (Baymard, 2026). Streamlined checkout flow yields an average 35.26% conversion uplift across Baymard's audited cohort.

SaaS B2B benchmark: what is a good SaaS conversion rate?

SaaS B2B average sits at 2% to 3% on free-trial or demo flows. B2B SaaS sites converting cold traffic to a free trial or demo request typically sit in the 2-3% band. Self-serve product-led SaaS (signup-to-active-user) varies wildly by category; enterprise SaaS demo conversion is closer to 1-2%.

A good SaaS conversion rate is 5% to 7%. The top-quartile B2B SaaS demo-request conversion rate sits in the 5-7% band. Above 7% on cold paid traffic, you're either targeting an extremely warm audience or your offer is exceptional (founder-led category with strong inbound brand).

The SaaS conversion rate question is almost always misframed. Cold-traffic landing page conversion (2-3% typical) is a very different metric from product-qualified-lead conversion (8-15% typical for SaaS-with-product-led-growth) or from free-to-paid trial conversion (15-25% typical). Make sure you're benchmarking the same conversion event before you start comparing yourself to anyone else's reported number. EM360 went 0.12% to 7% on a B2B SaaS lead-gen page in 30 days; the 0.12% baseline was so far below the SaaS floor that the discipline simply lifted it back to the category median.

Landing page benchmark: what is a good landing page conversion rate?

Landing page median sits at 6.6% across 41,000 pages. Unbounce's Conversion Benchmark Report (the largest single landing-page dataset published) found a median conversion rate of 6.6% across 41,000 landing pages, 464 million pageviews, and 57 million conversions (Unbounce, 2024). Landing pages convert at roughly double the rate of homepages because the traffic is intent-matched and the offer is singular.

A good landing page conversion rate is 10% or higher. Above 10% lands you in the top quartile of Unbounce's dataset. The highest-converting landing-page vertical in published benchmarks is events and entertainment at a 12.3% median, driven by clear-intent traffic plus natural scarcity dynamics. The lowest is SaaS / technology at 3.8%, dragged down by longer sales cycles and form-fill versus signup intent. VectorCloud's GDPR Compliance Checklist landing page hit 29.57% (34 conversions from 115 visits) on Glasgow B2B traffic in February 2018. Freshers Festivals hit 46.82% on Scotland-wide university-circuit traffic. Both are well above the vertical median; both happened because the offer was tightly matched to the audience and the page didn't ask the visitor to do anything else.

PPC landing page benchmark: what is a good PPC conversion rate?

PPC landing page average sits at 2% to 5% across most categories. Paid-traffic landing pages convert in the 2-5% band on average, materially lower than email-traffic landing pages on the same offer. Unbounce's data shows email traffic converting at 5-6x the rate of paid traffic for ecommerce landing pages, primarily because email traffic is warm and has prior brand exposure.

A good PPC conversion rate is 5% or higher on paid search. The good PPC conversion rate threshold sits around 5% on Google search ads, with WordStream's cross-vertical aggregates putting the median at roughly 4.4% across all industries. Facebook ad conversion rates are typically lower (1-3%) because the traffic is interruption-based rather than intent-matched. Display network conversion sits below 1% in most categories and is rarely worth direct-response optimisation; treat display as a brand or retargeting channel rather than a primary conversion surface.

If your PPC landing page is converting at 1% and your category median is 4%, the bottleneck is almost never the bid. It's the page. Pause the campaign, rebuild the landing page, restart at half the budget, and measure the page conversion rate before you touch the keyword list. That sequence is right roughly nine times out of ten.

Email sign-up form benchmark

Email sign-up form conversion rate sits at 1.95% on popup global average, up to 5%+ in the top quartile. Sumo's analysis of 2 billion popup views found a global average conversion rate of 1.95% for email opt-ins, with the top 10% of popups converting above 9.28%. Inline email-signup forms in blog content typically convert at 1-3%; exit-intent popups perform best in the 3-5% band when properly targeted. HubSpot's CTA research found personalised CTAs convert 202% better than generic buttons across a sample of roughly 330,000 CTAs.

If your sign-up form is converting at sub-1% on real traffic, the form itself is rarely the problem. The offer is. Trade-magnet quality (a useful asset versus a generic newsletter promise) lifts sign-up conversion more reliably than form-field tweaks. Test the offer before you test the colour.

Why these benchmarks vary so much

Traffic source mix is the single biggest variable. A site getting 80% of its traffic from branded organic search is going to convert at 3x the rate of an identical site getting 80% from cold paid social. Category matters next; price point matters after that; offer clarity matters above all. Two ecommerce stores in the same vertical with the same product and the same price can sit at 1.5% and 5% conversion respectively, and the difference is almost always above-the-fold copy plus checkout friction.

The benchmark is the reference line, not the goal. The goal is to beat your last quarter, in your category, on your real traffic mix. If you're below the category median, the benchmark tells you the gap is closable. If you're at the median, the benchmark tells you the next 30% lift is in the discipline below.

Across 41,000 landing pages, Unbounce found a 6.6% median conversion rate. Across 50 cart-abandonment studies, Baymard found a 70.19% global abandonment rate. The two numbers together say almost everything about ecommerce CRO: the page converts a fraction of visitors, and the checkout loses most of the rest. Optimise the page first, the cart second; ignore anything else until both are fixed.

Section 4: The Evidence Stack on A/B testing sample size and statistical significance

The benchmark tells you where you are. The Evidence Stack tells you how to move. Four layers of testing discipline that separate a programme producing trustworthy A/B test results from one producing flattering noise. The sample-size and significance layers are where almost every A/B testing programme leaks credibility, so this section is the longest in the primer for a reason.

Why A/B testing sample size matters more than the test itself

A/B testing sample size is the boring mechanic that prevents the most expensive failure mode in CRO. It's also the discipline most A/B testing programmes skip first. The reason is human: a winner notification at day three feels like a victory, and waiting another fortnight for the same dashboard to say "still 99% confident" feels like overthinking. The maths disagrees.

The peeking problem is the formal name for what happens when you read a dashboard before the minimum sample size is hit. Every time you look, the false-positive rate climbs. After roughly six peeks on a 95% test, the actual false-positive rate has roughly doubled relative to the nominal threshold. The dashboard still says 95%. Reality says something much closer to 80%. This is what Johari, Pekelis and Walsh's KDD 2017 paper was published to address, and it's why platforms that implement sequential or Bayesian frameworks (Optimizely's Stats Engine, AB Tasty's default mode, Statsig, Eppo) handle continuous monitoring without inflating false-discovery rate. The classic 95%-fixed-sample-size pattern most older A/B testing tools default to does not.

The A/B test sample size formula (in plain English)

The standard A/B test sample size formula has three inputs: your baseline conversion rate, the minimum detectable effect you care about, and the statistical power you want. The output is the minimum number of visitors per variant you need before reading the result.

The formal version is the Lehr formula (or the two-proportion variant Optimizely and VWO publish). In plain English: required sample size per variant ≈ 16 × (baseline conversion rate × (1 − baseline)) ÷ (minimum detectable effect × baseline)², at 80% power and a 5% significance level. The "16" comes from the standard normal distribution values for 80% power and 95% two-sided significance. Bump the threshold to 99% and the multiplier climbs to roughly 24. Bump power to 90% and it climbs again.

You don't need to do the maths by hand. Every major A/B testing platform publishes a sample size calculator. Evan Miller's calculator is the academic-rigour benchmark. Optimizely publishes one. VWO publishes one. We've also built our own sample size calculator tuned for the 99% confidence threshold and the under-traffic protocol below. Use whichever sits closest to your workflow; the maths is the same.

Worked example: how many visitors do I need to A/B test?

Say your baseline conversion rate is 2.5% (mid-ecommerce). You want to detect a 10% relative lift (so a move from 2.5% to 2.75%). You want 80% power and 95% significance. Plug those into any sample-size calculator and the output is roughly 30,000 visitors per variant, so 60,000 total. At 20,000 monthly visitors to the test page, that's three months to detect the effect at 95% significance, and closer to four-and-a-half months at 99%.

Now flip the numbers. Baseline 5%, MDE 20% relative lift (so 5% to 6%), 80% power, 95% significance. Required sample size drops to roughly 7,800 per variant. Higher baselines and bigger detectable effects need dramatically less traffic; lower baselines and smaller effects need dramatically more. A site converting at 0.5% trying to detect a 10% relative lift needs over 150,000 visitors per variant. This is why most low-traffic founders shouldn't run pure A/B tests at all; the under-traffic protocol below is the answer instead.

The single biggest A/B testing mistake in this calculation is setting the minimum detectable effect at whatever number gets you a runnable sample size. If you set MDE to 50% relative lift just so the test fits inside your monthly traffic budget, you've quietly told the test: "I'm only interested in absurdly large wins." Real CRO lifts are usually in the 5-20% range per test; setting MDE above that band means you'll declare most of your wins inconclusive even when they're real.

95% vs 99% confidence: why GoGoChimp uses 99%

95% confidence means a 1-in-20 chance the observed lift is noise. 99% confidence means a 1-in-100 chance. The difference doesn't sound material until you multiply across an annual testing programme.

A 120-test annual programme run at 95% produces roughly six false positives shipped as winners. The same programme at 99% produces roughly 1.2. Five times fewer false positives compounding into the baseline over 24 months. False positives don't just fail to lift conversion; they actively roll back the baseline because the production effect is usually negative, the dashboard never re-checks, and the next test compares its new variant against an already-degraded control.

The 99% trade-off is real. A 99% test needs roughly 50% more traffic and 50% more time to call winners than a 95% test. For sites with sufficient traffic (above roughly 30,000 monthly sessions per test surface), the trade-off is worth it on every metric except impatience. For under-traffic sites, the answer is the under-traffic protocol below, not running a 95% test you don't trust.

The under-traffic protocol (sub-30,000 sessions per test surface)

For sites that can't hit the sample-size threshold inside a sensible test window, pure A/B testing isn't the right tool. The protocol changes but does not relax. 95% threshold plus 1,000 unique sessions per variant plus an independent confirmation signal (a heatmap pattern, a customer-interview finding, a support-ticket cluster, a survey response). The finding gets logged as directional, not as a winner. Vocabulary distinction matters. Directional findings inform the next hypothesis. Winners ship to production.

If you can't gate a finding at 99% with a real sample size, don't dress it up as a winning A/B test in your reporting. Call it directional, ship it as a calculated bet if the qualitative signal is strong, and re-test once traffic catches up. The discipline isn't "always run a 99% test." The discipline is "always know which evidence band you're operating in, and never pretend you're in a higher band than the data supports."

The tools that enforce significance properly

Not every A/B testing platform handles sample-size and significance the same way. Some default to fixed-sample-size frequentist tests with hard-stop thresholds (the safest pattern for naive users). Some default to Bayesian frameworks that allow continuous monitoring without false-discovery inflation (more flexible, requires understanding what "P(B beats A)" means). Some implement sequential testing under the mSPRT framework Johari et al. published (the academic-rigour pattern). All three are defensible; the dangerous default is fixed-sample-size frequentist testing with the dashboard pinging "winner" at any peek before threshold.

I cover this in more depth in the 2026 A/B testing tools review, which compares VWO, Convert, AB Tasty, Optimizely, Statsig and Eppo on exactly this dimension. The short version: if you're picking a tool in 2026, pick one whose default statistical mode matches the testing discipline you actually want to run, and override the platform default to 99% on day one regardless.

Optimizely's Stats Engine implements the mixture sequential probability ratio test from Johari, Pekelis and Walsh's KDD 2017 paper specifically to handle the peeking problem (Johari et al., 2017). If your A/B testing platform doesn't tell you which statistical framework it uses, assume fixed-sample-size frequentist and override the default to 99% on every test.

EXCLUSIVE: What Build Grow Scale's 347 founders learned about CRO

Here is the bit no other CRO blog will tell you, because no other CRO blog has read the underlying data the way I have. Build Grow Scale's 2026 review didn't survey 347 random stores. It surveyed 347 stores doing $300K to $8M a month in revenue, a band that captures almost every founder who has ever called me asking "is CRO worth it." If you're inside that band, you are the founder this research was written about. Worth knowing what it actually found.

The headline number is the 4-to-34 differential: self-serve AI tools delivered 4-7% lift, expert-guided AI on the same tools delivered 28-34%. Most blogs stop there. The interesting finding sits one layer below.

Pull the two bands apart and the underlying mechanics get sharper. Three data points worth singling out.

Data point 1: The 4-7% DIY band is a ceiling, not a starting point. Stores that ran DIY AI tools for 12+ months did not migrate up the band over time. The ceiling held. The tools were not "learning their store" into a higher-lift outcome. Whatever the tool could do on day one was what it could do on day 365. If you've been running a DIY AI testing tool for a year and you're still in the single-digit-lift band, that's the structural ceiling, not a slow start.

Data point 2: The 28-34% expert-guided band held across platforms. Stores using Optimizely, VWO, AB Tasty, Convert, and Eppo all landed inside the 28-34% band when a trained CRO expert wrote the hypotheses. The platform was not the variable. The discipline was. Cancelling one platform to subscribe to another did not move the band. Hiring or training a CRO expert did.

Data point 3: The differential compounded over engagement length. The 4-7% and 28-34% bands are the annual lift figures. Compounded across a 24-month engagement, the 28-34% band produces roughly 64-80% baseline lift versus 8-15% for the DIY band. The 4-to-34 differential at 12 months becomes a 4-to-50+ differential at 24 months. Length of engagement multiplies the gap.

What separated the 28-34% stores from the 4-7% stores

It wasn't budget. It wasn't team size. It wasn't tooling sophistication. The differentiator was who wrote the test hypothesis. When the AI tool wrote the hypothesis (a "test ideas" panel, a dashboard suggestion, a generative variant pack), the store landed in the 4-7% band. When a trained CRO expert wrote the hypothesis and used the AI tool to execute, the same store landed in the 28-34% band.

That maps onto something I've watched for 13 years on the engagement side. The platforms have got much better. The tools have got much smarter. The thing that has barely moved is the quality of the hypothesis at the top of the funnel. That's where the differential lives, and that's where almost every "AI CRO" pitch you'll hear in 2026 is silent.

What the 347-store data tells you to do as a founder

Three concrete reads. First: the platform you've subscribed to is probably fine. VWO, Convert, AB Tasty, Optimizely all sit inside the same band when used the same way. Cancelling one and re-subscribing to another rarely moves the result.

Second: the discipline that produces the 28-34% band is human, not technological. The 80-hour DTC founder cannot run it without help, because the discipline requires uninterrupted focus on hypothesis design, sample-size discipline, and failure-mode tagging across every test cycle.

Third: the agencies that genuinely produce the 28-34% band are the ones who lead with their hypothesis-design discipline, not their tech stack. If you've sat through a sales pitch recently that focused on the agency's "AI tooling," "machine learning models," or "proprietary platform" without ever describing how the agency writes hypotheses, you've been pitched on the 4-7% band even if the agency itself doesn't realise it.

The single highest-signal question to ask any CRO agency in 2026: "Show me how you wrote the winning hypothesis on your last three engagements." If the answer is the dashboard, the tool, or the AI, you're being sold the 4-7% band. If the answer is a named CRO expert who walked through the audience evidence, you're being sold the 28-34% band.

This is the framing I use on every audit call. It's not in any vendor's marketing copy because the framing rules out the vendor that runs on tool-set hypotheses, and ruling out 60% of the agency market is bad for the agency market. The framing is good for you. The full argument sits in the 4-to-34 Gap framework page and the underlying research is in the State of CRO 2026 deep-research piece.

EXCLUSIVE: Five founder receipts from GoGoChimp's client roster

Receipts matter to founders more than they matter to anyone else, because we've all sat through pitches that turned out to be theatre. Five engagements where the discipline ran end-to-end. Different sectors. Different scopes. Different time horizons. The discipline did not change. Full receipts at gogochimp.com/case-studies.

Receipt 1: Enzymedica UK, Black Friday 2021 (Shopify, health supplements)

Sector: Health supplements, Shopify, owned by TMC Ventures Europe Ltd. Relationship length: 13 years end-to-end. Baseline: 3.4% site conversion.

Three CRO wins compounded across a 30-day window from 5 December 2021 to 5 January 2022, one of the worst calendar months for health-supplement sales.

Outcome: 3.4% baseline to 16.9% on Black Friday (a 4.97x lift) and 11% sustained through December. The prior year's Black Friday converted at around 7%, which means the 16.9% is a 2.4x lift on the same promo day with the same product line. Three compounded CRO wins, not a single-day spike.

Every hypothesis was expert-set against a documented revenue-impact ranking. Sample size was hit before any read. The 99 Rule gated every winner-call. Failure-modes from earlier December tests informed the Black Friday hypotheses directly. The full analytics walkthrough is on Loom (Enzymedica analytics review).

This is what real CRO discipline looks like when all four layers fire on the highest-stakes commercial day of the year. Arnie Liepa, the owner, summarised it the morning after: "It seems to have gone pretty darned well, slightly better than I expected, so thanks to you and Leyla for that." Quiet endorsements from clients who have seen 13 years of work are worth more than loud endorsements from clients who have seen 13 days.

Receipt 2: BeeFRIENDLY Skincare (Ezra Firestone brand, 2017, DTC Shopify)

Sector: DTC skincare, Shopify. Baseline: $48,000/year on the baseline funnel. Engagement fee: $3,000.

Layer 1 of the testing discipline picked image-weight reduction on the highest-traffic landing pages, hypothesised against revenue-per-visitor. Layers 2 and 3 gated sample size and the 99% confirmation before deployment. Layer 4 logged the unsuccessful pre-fix variants into the engagement's hypothesis library.

Outcome: bounce rate 82.04% to 38.4%. Per-visitor value $1.28 to $29.03. Annual revenue $48,000 to $1,447,225, a roughly 30x revenue multiplier from a 2.24-second page-speed reduction. Numbers held for at least six months. Public case study video (client anonymised in the public version): youtu.be/z2bjGvAkqn0.

The intervention was page-speed engineering. The reason it scaled to 30x revenue was that the discipline was applied to the speed work as a CRO hypothesis, not as an IT ticket. Speed mattered because it was the highest-revenue-impact lever on the audience. Layer 1 found that. The other three layers gated it.

The underlying revenue-elasticity is documented in Google and Deloitte's "Milliseconds Make Millions" study (Google & Deloitte Digital, 2020), which analysed 30 million user sessions across 37 European and American brand sites: every 0.1 seconds of mobile load-speed improvement increased conversion by 8.4% in ecommerce, 10.1% in travel, and 3.6% in luxury. BeeFRIENDLY's 2.24-second reduction sits at the high end of that elasticity curve because the site was failing every Core Web Vital before the engagement.

Receipt 3: Super Area Rugs (Shopify, home goods)

Sector: Home goods, Shopify. Intervention: Above-the-fold value-proposition copy, mobile-first.

The above-the-fold value proposition was the bottleneck. Layer 1 identified that the page hero was answering "what is this product" instead of "why should I buy from you rather than the eight other rug retailers I just searched." The hypothesis was anchored on first-time mobile visitors, the lowest-context segment, and predicted a benefit-led headline would lift revenue per visitor.

Outcome: 216.29% revenue increase in 37 days. Almost entirely on above-the-fold copy work. No discount codes, no urgency banners, no platform migration. A single hypothesis well-set and properly gated produced a 3x revenue swing inside six weeks.

This is the canonical Layer 1 receipt. The platform stayed the same. The product range stayed the same. The traffic source stayed the same. What changed was what the page was saying, and what the page was saying changed because a trained CRO expert had decided what to test first.

Receipt 4: Donate For Charity (US non-profit, donation conversion)

Sector: US non-profit donation platform. Conversion event: Donations, not revenue. Intervention: Donation-form amount-anchor restructure.

The audience was warm donors arriving from email and search, but the donation form was leaking them at the amount-selection step. Hypothesis: the default amount anchor was set against industry-average benchmarks rather than this organisation's audience, and warm donors were defaulting to the smaller end of the anchor range.

Outcome: 494.64% more donations in 30 days. The intervention was the donation form, not the page around it. Layer 1 hypothesis quality made the difference. Layer 4 logged the pre-fix variants so the donation-form failure mode is now part of the GoGoChimp engagement library and informs every non-profit engagement that follows.

Non-profits sit outside the typical CRO sales-pitch coverage because the agency model assumes ecommerce revenue. The discipline transfers. The receipt is in the case-studies archive alongside the ecommerce receipts because the discipline is the same.

Receipt 5: EM360 (B2B SaaS, lead generation)

Sector: B2B SaaS lead generation. Baseline: 0.12% conversion, below the floor most CRO agencies will engage on. Intervention: Top-of-funnel qualification structure.

Layer 1 hypothesised that the page was failing to qualify the visitor against the product before asking for the lead, and that adding qualification structure at the top of the funnel would lift the conversion rate even though it would reduce raw form-fills.

Outcome: 0.12% to 7% in 30 days. A 58x lift on the conversion rate, against the prior trend, with the same traffic source and ad spend. The full discipline was applied: expert-set hypothesis, sample-size discipline (the 1,000-unique-session protocol for under-traffic sites was used here), 99% confirmation, failure-mode logging.

What the five engagements share

Different sectors. Different time horizons. Different scopes. The compounding pattern is the same: every test gated by expert-set hypothesis, sample-size discipline, 99% statistical significance, failure logged. The platform changed (Shopify across most, but VWO, Convert, Optimizely across the wider engagement library). The discipline did not.

That is the answer to "what does GoGoChimp actually do differently." The honest answer is not "we run A/B tests." Every agency runs A/B tests. The answer is the discipline above, applied repeatedly, on every client, every time.

Across five engagements spanning DTC skincare, supplements, home goods, B2B SaaS, and non-profit donation, the result band stayed inside 200%-30,000% on the relevant conversion metric. The constant variable was the testing discipline. The variable variable was every other factor a sales pitch would highlight.

Three first steps you can take this week without hiring anyone (how to start CRO)

If the discipline above reads like what you want and you're not ready to engage an agency, three concrete steps will move your programme into the upper band without spending a pound on services. Do them this week. This is the answer to "how to start CRO" and "how to improve conversion rate" without a budget.

Step 1: Set your testing platform default to 99% confidence (90 seconds, no vendor required)

Step 1: Open your testing platform settings and change the winner-call threshold from 95% to 99%. This is the single highest-impact discipline shift in this primer. It takes ninety seconds.

VWO, Convert, AB Tasty, Optimizely all support 99% statistical significance as a project-level default. Most ship with 95% as the out-of-the-box setting. Open the platform settings, find the experimentation defaults, change the winner-call threshold from 95% to 99%, save.

Your future tests now need around 50% more traffic and 50% more time to call winners. They also produce five times fewer false positives compounding into the baseline. Every test you run from this point forward is structurally protected against the most expensive failure mode in CRO. Ninety seconds. No vendor required.

Step 2: Open a Failure Log and back-fill it with your last 10 losing tests (30 minutes)

Step 2: Create a four-column document and back-fill it with the losing tests you've already run. Title it "Failure log." Four columns: hypothesis, assumption, result, what didn't fire.

Go back through your last 10 losing tests. Fill in each row. The hypothesis you ran. The audience-mechanism assumption underneath it. The actual result. The reason the mechanism didn't fire.

A pattern emerges in the back-fill. Three tests where the assumption was about price-sensitivity but the data implied something else. Two tests where the assumption was about mobile UX but the loss was on desktop. The pattern is the sharper next hypothesis. The dashboard would never have surfaced it because the dashboard discarded the losing variants the moment they were called.

The back-fill almost always surfaces a usable hypothesis more revenue-relevant than anything in the current test pipeline. The discipline is free. The artefact compounds.

Step 3: Calculate minimum sample size before launching your next test (10 minutes per test)

Step 3: Compute the minimum sample size for your next test using a free power calculator, write it on the brief, hold the line. Pick any free power calculator. Optimizely publishes one. VWO publishes one. Evan Miller's calculator is the academic-rigour benchmark. GoGoChimp's own sample size calculator is tuned for the 99% threshold and the under-traffic protocol. Three inputs across all of them: baseline conversion rate, minimum detectable effect (typically 10% relative lift), statistical power (80% standard).

The output is a number, the minimum sample size per variant. Write that number on the test brief. Tape it to your monitor if it helps. Do not read the dashboard until the number is hit.

This is the single discipline most CRO programmes fail to enforce, and the failure costs more than any other. The ninety seconds in Step 1 protects against false positives at the threshold layer. The sample-size discipline in Step 3 protects against false positives at the sample layer. Both have to fire on every test for the programme to be statistically honest.

Three steps. No agency. No invoice. The discipline moves your programme out of the 4-7% band before you've had a single sales call.

Where to go from here

If you've read this far, you have the benchmark and the discipline covered. The next move depends on what you're trying to do.

If you're starting cold and want the definitional foundation, the pillar page is What is conversion rate optimisation? CRO explained for 2026. It covers what counts as a conversion, micro versus macro conversions, the formula, the history, and how the discipline sits next to SEO and paid acquisition.

If you want the A/B testing sample size calculator I referenced in Section 4, it's live at gogochimp.com/tools/cro-sample-size-calculator. Defaults to 99% significance and 80% power. Plug your baseline conversion rate and MDE in, get the visitors-per-variant number out, write it on your test brief.

If you want to pick an A/B testing platform that actually enforces the discipline above, the 2026 review is at the best A/B testing tools 2026 page. It compares VWO, Convert, AB Tasty, Optimizely, Statsig and Eppo on the statistical-framework dimension, not the marketing-feature dimension.

If you want the full framework for the 4-to-34 differential, the standalone page is at the 4-to-34 Gap framework page and the underlying methodology is at gogochimp.com/methodology.

If you want to see if GoGoChimp's engagement model fits your situation, the proof is at gogochimp.com/case-studies and the free CRO audit at /audit runs 48 hours from submission to delivered Loom walkthrough. It is read personally by me. It will name the bottleneck, name the buying path that fits, and tell you whether the right move is a productised engagement, a retainer, or none of the above. Some audits end with "don't hire anyone, fix this one Liquid file and run more ads." That is the audit doing its job.

Buy when buying is the answer. Not before.

Frequently Asked Questions

Is 10% a good conversion rate?

Yes. A 10% conversion rate is top decile across most ecommerce and SaaS categories. The global ecommerce average sits between 2.5% and 3%, with 5% marking the top quartile of DTC stores. A 10%+ conversion rate is realistic in three specific contexts: high-repeat consumables with subscription mechanics (skincare refills, supplements, coffee), single-product narrow-niche stores where the offer is unambiguous, and event or digital-product offers with natural scarcity. Enzymedica UK hit 16.9% on Black Friday 2021 on the UK-only segment; that was a category-friendly product on a CRO-disciplined engagement. 10%+ is unusual but not mythical. If you're hitting it sustainably on a normal mixed-traffic ecommerce store, you're outperforming your category.

What is a good conversion rate for ecommerce?

2.5-3% is the global ecommerce average; 5% or higher is top quartile; 10%+ is top decile. The exact figure varies by sub-category (apparel 1.8-2.4%, beauty/health 2.8-3.5%, consumer electronics 1.4-2%) and by traffic source (branded organic converts at roughly 3x the rate of cold paid social). Below the average for your sub-category, the floor is the place to start. At the average, the next 30% lift is in checkout flow and value-proposition copy. Above 5%, average order value and revenue-per-visitor become the more honest metrics than rate alone.

What is a good website conversion rate?

A good website conversion rate depends on what you're asking the visitor to do. For ecommerce, 2.5-3% is average and 5%+ is good. For B2B SaaS demo requests, 2-3% is average and 5-7% is good. For landing pages with a single offer, 6.6% is the cross-vertical median (Unbounce, 2024) and 10%+ is top quartile. For email sign-up forms via popup, 1.95% is the global average and 5%+ is top quartile (Sumo). The benchmark only matters as a reference line. The real goal is to beat your last quarter in your category on your real traffic mix.

How many visitors do I need to A/B test?

For a 2.5% baseline and a 10% minimum detectable effect at 95% confidence and 80% power, roughly 30,000 visitors per variant. The required sample size scales inversely with baseline conversion rate and minimum detectable effect. A 0.5% baseline trying to detect a 10% relative lift needs over 150,000 visitors per variant; a 10% baseline trying to detect a 20% relative lift needs under 3,000. Run any free sample size calculator (Evan Miller's, Optimizely's, VWO's, or GoGoChimp's) with your real baseline, MDE and power inputs before launching the test. If the answer is "more visitors than you'll see in three months," don't run a pure A/B test; use the under-traffic protocol in Section 4 instead.

What's the difference between 95% and 99% statistical significance?

95% confidence accepts a 1-in-20 chance the observed lift is noise; 99% confidence accepts a 1-in-100 chance. Across a 120-test annual programme, 95% produces roughly six false positives shipped as winners; 99% produces roughly 1.2. The trade-off is that a 99% test needs around 50% more traffic and 50% more time to call winners. Over a 24-month engagement, that gap is the difference between a baseline that grows and one that drifts as rolled-back winners compound. GoGoChimp runs every winner-call at 99% as standard; the 95% platform default is the most expensive setting in CRO at programme scale.

Why isn't my CRO working?

Almost certainly one of four failure modes from Section 2: hypothesis poverty, sample-size impatience, 95% confidence shipping, or failure-mode logging skipped. Run a diagnostic: are your tests written as audience-anchored hypotheses or as "let's try this"? Is the minimum sample size computed and written on the brief before launch? Is the winner-call threshold set to 99% on every test? Is there a failure log that gets read every quarter? If the answer to any of those is no, you've found the leak. Fix the one that scores lowest first; the rest get easier from there.

How is CRO different from SEO or paid ads?

SEO and paid ads are acquisition disciplines. CRO is a conversion discipline. SEO and ads bring traffic to the site. CRO takes the traffic you already have and converts more of it into revenue. The three work in sequence: SEO and ads fill the funnel, CRO closes more of it. Below roughly 30,000 monthly sessions on the page you want to test, spend on acquisition first.

When should I NOT invest in CRO?

Don't buy CRO if any of these are true: you're under £50K/month revenue, your traffic is below roughly 30,000 monthly sessions on the page you want to test, you've replatformed in the last 60 days, or you don't have GA4 and Microsoft Clarity properly installed. Below those thresholds the maths doesn't work, and the right move is acquisition or analytics infrastructure first.

Can I do CRO in-house?

Yes, if you have the bandwidth, the discipline, and the seniority. A product or growth function with bandwidth to manage testing weekly, the discipline to ship implementation when a variant wins, and the seniority to set expert hypotheses rather than letting the dashboard generate them. The starting points are defaulting your testing platform to 99% confidence, opening a failure log of your last 10 losing tests, and calculating minimum sample size before every launch. SaaS founders with a product team and ecommerce founders with a tech-led culture are the natural fit.

How long until I see results from CRO?

Productised work pays back inside 30-60 days at scale. Retainer programmes show compounding signal at months 2-3 and full ROI by month 6. Productised page-speed engagements pay back within 30-60 days at the £200K+/month revenue bracket. Productised headline work pays back within 30 days when traffic is sufficient. Below £100K/month revenue, payback windows extend. Below £50K/month the maths rarely closes at all.

What's the difference between OperatorAI and OpenAI's Operator agent?

Different products that share a name by accident. OperatorAI is GoGoChimp's CRO methodology, a structural discipline for running expert-guided AI testing. OpenAI's Operator is an autonomous web agent product released January 2025 that browses the web and completes tasks on behalf of users. The names sound similar by accident; they do entirely different things. If a CRO vendor is leaning on OpenAI's Operator as part of their CRO pitch, they are selling something other than CRO.

References

Stafford, M. (9 April 2026). 2026 CRO Year in Review: What Worked, What Failed, What's Next. Build Grow Scale. https://buildgrowscale.com/cro-trends-2026-recap

Miller, E. (2010). How Not to Run an A/B Test. https://www.evanmiller.org/how-not-to-run-an-ab-test.html

Miller, E. (2014). A/B Testing Sample Size Calculator. https://www.evanmiller.org/ab-testing/sample-size.html

Johari, R., Pekelis, L., & Walsh, D. J. (2017). Peeking at A/B Tests: Why it matters, and what to do about it. KDD 2017. https://dl.acm.org/doi/abs/10.1145/3097983.3097992

Zaltman, G. (2003). How Customers Think: Essential Insights into the Mind of the Market. Harvard Business School Press.

Unbounce. (2024). Conversion Benchmark Report. https://unbounce.com/conversion-rate-optimization/

Baymard Institute. (2026). Cart Abandonment Rate Statistics (50-study meta-analysis). https://baymard.com/lists/cart-abandonment-rate

Baymard Institute. (2026). Checkout Usability Research. https://baymard.com/research/checkout-usability

Google & Deloitte Digital. (2020). Milliseconds Make Millions. Think with Google. https://www.thinkwithgoogle.com/_qs/documents/9757/Milliseconds_Make_Millions_report_hQYAbZJ.pdf

Statista. (2026). Global ecommerce conversion rate benchmarks by vertical. Statista Digital Commerce Reports.

McCarron, C. (2021). Enzymedica UK analytics walkthrough (Black Friday 2021 / December 2021 engagement). Loom. https://www.loom.com/share/d20fd92f4d5e49a88a92c9c0d5e28570

GoGoChimp. (2017). BeeFRIENDLY Skincare page-speed case study video (anonymised). YouTube. https://youtu.be/z2bjGvAkqn0

Jacobson, A. (April 2026). Verified Trustpilot review of GoGoChimp (Affordable Golf engagement). Trustpilot. https://uk.trustpilot.com/reviews/6a02e7324675140e5f1f6d7c

GoGoChimp. (2026). What is conversion rate optimisation? CRO explained for 2026. https://www.gogochimp.com/blog/what-is-conversion-rate-optimisation-cro-explained-for-2026

GoGoChimp. (2026). CRO sample size calculator. https://www.gogochimp.com/tools/cro-sample-size-calculator

GoGoChimp. (2026). Best A/B testing tools 2026. https://www.gogochimp.com/blog/best-ab-testing-tools-2026

GoGoChimp. (2026). The 99 Rule. https://www.gogochimp.com/framework/99-rule

GoGoChimp. (2026). The 4-to-34 Gap. https://www.gogochimp.com/blog/ai-cro-4-to-34-percent-gap

GoGoChimp. (2026). The State of CRO 2026. https://www.gogochimp.com/blog/state-of-cro-2026

GoGoChimp. (2026). OperatorAI methodology page. https://www.gogochimp.com/methodology

GoGoChimp. (2026). Case studies archive. https://www.gogochimp.com/case-studies

About the author

Chris McCarron founded GoGoChimp in June 2013. Thirteen years of hands-on CRO expert experience across ecommerce, SaaS, and non-profit conversion rate optimisation. Creator of OperatorAI, GoGoChimp's CRO methodology, distinct from OpenAI's Operator agent product released January 2025. Nominated for Digital Doughnut Digital Marketing Agency of the Year 2021. Based in Glasgow, Scotland; works with clients across the UK, EU, and US.

Client outcomes include BeeFRIENDLY Skincare ($48,000/year to $1,447,225/year), Enzymedica UK (3.4% to 16.9% Black Friday conversion), Super Area Rugs (216% revenue increase in 37 days), EM360 (0.12% to 7% B2B conversion in 30 days), and Donate For Charity (494.64% donation lift in 30 days). Full case studies at gogochimp.com/case-studies. Services at gogochimp.com/services. Free CRO audit at gogochimp.com/audit.

The 15-minute read for founders who want a real benchmark, not a sales pitch

Section 1: What CRO actually is (the short version)

Section 2: Why most CRO programmes fail (the four failure modes)

Section 3: What is a good conversion rate? The 2026 benchmark table

Ecommerce benchmark: what is a good ecommerce conversion rate?

SaaS B2B benchmark: what is a good SaaS conversion rate?

Landing page benchmark: what is a good landing page conversion rate?

PPC landing page benchmark: what is a good PPC conversion rate?

Email sign-up form benchmark

Why these benchmarks vary so much

Section 4: The Evidence Stack on A/B testing sample size and statistical significance

Why A/B testing sample size matters more than the test itself

The A/B test sample size formula (in plain English)

Worked example: how many visitors do I need to A/B test?

95% vs 99% confidence: why GoGoChimp uses 99%

The under-traffic protocol (sub-30,000 sessions per test surface)

The tools that enforce significance properly

EXCLUSIVE: What Build Grow Scale's 347 founders learned about CRO

What separated the 28-34% stores from the 4-7% stores

What the 347-store data tells you to do as a founder

EXCLUSIVE: Five founder receipts from GoGoChimp's client roster

Receipt 1: Enzymedica UK, Black Friday 2021 (Shopify, health supplements)

Receipt 2: BeeFRIENDLY Skincare (Ezra Firestone brand, 2017, DTC Shopify)

Receipt 3: Super Area Rugs (Shopify, home goods)

Receipt 4: Donate For Charity (US non-profit, donation conversion)

Receipt 5: EM360 (B2B SaaS, lead generation)

What the five engagements share

Three first steps you can take this week without hiring anyone (how to start CRO)

Step 1: Set your testing platform default to 99% confidence (90 seconds, no vendor required)

Step 2: Open a Failure Log and back-fill it with your last 10 losing tests (30 minutes)

Step 3: Calculate minimum sample size before launching your next test (10 minutes per test)

Where to go from here

Frequently Asked Questions

Is 10% a good conversion rate?

What is a good conversion rate for ecommerce?

What is a good website conversion rate?

How many visitors do I need to A/B test?

What's the difference between 95% and 99% statistical significance?

Why isn't my CRO working?

How is CRO different from SEO or paid ads?

When should I NOT invest in CRO?

Can I do CRO in-house?

How long until I see results from CRO?

What's the difference between OperatorAI and OpenAI's Operator agent?

References

About the author

Want us to do this for your site?

Keep reading

Related post title — bind from Related Posts multi-ref

Related post title — bind from Related Posts multi-ref

Related post title — bind from Related Posts multi-ref