PILLAR
A/B testing is the operator’s version of opinion settlement. Two variants, statistical significance, and a winner the data picks — not the highest-paid person in the meeting.
Most teams fail A/B testing in one of three ways: they test too few hypotheses to statistically notice winners, they stop tests at the wrong moment, or they test on traffic volumes that make results meaningless. Industry average is 2–3 tests per quarter. The testing programmes that transform businesses run 30+.
Every post walks through one part of that work: hypothesis prioritisation, sample-size maths, multi-variate vs A/B, reading interaction effects, and shipping winners without regressions.
DEFINITION
A/B testing is a controlled experiment that splits one audience between two versions of the same page, email, ad, or product feature — Version A (the control) and Version B (the variant) — then uses a statistical-significance test to decide which version won. The winner is the version that produces more of the target outcome (purchases, signups, clicks, retained users) by an amount large enough that it could not plausibly be random.
In plain English: when you can’t agree which headline, button colour, pricing layout, or onboarding flow is better, you stop arguing and let the people who actually buy from you decide. A/B testing is the operator’s version of opinion settlement.
The term “A/B testing” is used interchangeably with split testing — they describe the same method. Multivariate testing (MVT) is a different beast: it tests many element combinations at once and needs roughly 5–10× the traffic to reach significance.
PROCESS
The textbook version of A/B testing has six steps. The version that ships winners has eight.
TYPES
| Method | What it tests | Traffic required | Best for |
|---|---|---|---|
| A/B test | One variable, two versions | Moderate (~1,000 conversions/variant) | Single-element changes: headlines, CTAs, pricing labels, hero images |
| Split test (URL split) | Two entirely different page templates served from different URLs | Same as A/B | Full-page redesigns where you can’t variant-swap a single element |
| Multivariate (MVT) | Multiple variables × multiple values, all combinations tested simultaneously | ~5–10× an A/B test | Mature programmes with high traffic that need to find element-interaction effects |
A/B and split tests are the same statistical model — the only difference is where the variant lives (same URL with element swap vs different URL with whole-page swap). The decision is implementation, not method.
Multivariate testing is rarely the right choice for sub-£10M/year sites. The traffic requirement kills it before you reach significance. Most teams who think they need MVT actually need to run sequential A/B tests — find the winning headline, then find the winning button, then find the winning image — each as a separate, fully-powered experiment.
STATISTICAL SIGNIFICANCE
The single most common A/B testing failure is calling winners on tests that never reached significance. The fix is two numbers, locked in before the test launches:
The “peeking problem” — looking at a test daily and stopping the moment confidence crosses 95% — turns a 5% nominal false-positive rate into a 30%+ effective false-positive rate. The fix is sequential-testing methods (Always Valid Inference, used by Optimizely Stats Accelerator) or Bayesian inference (used by VWO SmartStats), both of which are valid under continuous monitoring.
For most programmes, the simpler fix is: pre-calculate the sample size, set a calendar reminder for the runtime, and don’t open the dashboard until that date.
EXAMPLES
Does Netflix use A/B testing? Yes — Netflix is one of the most aggressive A/B testers in software. They publish technical posts on the methodology under “Netflix TechBlog.” Public examples: artwork-personalisation tests where the title-card image varies by viewer history (different posters for different audience cohorts on the same show), autoplay-trailer tests, and the original “Skip Intro” button rollout — all decision-graded on retention and watch-time metrics, not just clicks.
Meta / Facebook Ads: Meta’s “A/B Test” feature inside Ads Manager runs split-traffic experiments across ad sets to find the highest cost-per-result configuration. The same statistical engine powers Meta’s organic experiments — feed-ranking changes are A/B-tested on small user cohorts before global rollout.
The pattern: hypothesis-led, single-variable, 99%-confidence, no peeking. The 4-to-34 Gap framework explains why operator-led A/B testing programmes hit 28–34% lifts while self-serve AI tools cluster at 4–7%.
TOOLS
| Platform | Best for | Pricing model | Notes |
|---|---|---|---|
| Optimizely | Enterprise, multi-product testing | Custom (~£50K+/year) | Stats Accelerator handles peeking-safe inference |
| VWO | Mid-market ecommerce + SaaS | From ~£300/month | SmartStats uses Bayesian — peeking-safe by default |
| Google Optimize | (Sunset Sept 2023) | — | Migrate to GA4 + a paid tool |
| Convert | Privacy-first, GDPR-strict | From ~£300/month | EU data residency, no third-party cookies required |
| AB Tasty | Enterprise + personalisation | Custom | Strong feature-flagging integration |
| PostHog / Statsig / GrowthBook | Product-led teams | From free | Code-deployed, feature-flag-native, engineer-friendly |
| Mailchimp / Klaviyo | Email-only A/B testing | Included | Subject-line + send-time tests built in |
| Meta Business Suite | Ad-creative + audience splits | Included | Native split-test inside Ads Manager |
The right tool is the cheapest one your team will actually use. A £1,000/month platform that no one logs into produces zero tests. A free PostHog instance owned by an engineer who runs four tests a month produces a 28–34% lift after a year.
DISCIPLINES
A/B testing in digital marketing: subject lines, ad creative, landing-page headlines, audience segments. The variable is messaging; the metric is conversion or cost-per-result.
A/B testing in email marketing: subject line, send time, sender name, body copy, CTA placement. Mailchimp and Klaviyo run these natively against open-rate or click-rate.
A/B testing in Facebook ads / Meta Ads: ad-set splits on creative, audience, placement, or bid strategy. Meta’s native A/B Test feature runs these statistically.
A/B testing in product management: feature rollouts. Show new feature to 50% of users, measure activation/retention impact, decide based on the statistical comparison. Tools: PostHog, Statsig, LaunchDarkly.
A/B testing in data science: the underlying frequentist or Bayesian inference problem — two samples, one test statistic, one p-value or posterior. SQL-driven A/B tests pull conversion counts from event tables and compute significance in-database.
A/B testing in machine learning: champion/challenger model rollouts. Production traffic is split between a current-best model and a candidate, measuring against the business metric the model is meant to move (CTR, conversion, latency, downstream LTV).
A/B testing in QA: controlled rollout of code changes (often called canary deployments or shadow traffic) — different from marketing A/B testing in intent (find bugs vs find lifts) but mathematically identical (split traffic, compare outcomes).
A/B testing in social media: TikTok, YouTube, and Instagram all let creators test thumbnails, hooks, or titles. YouTube’s “Test & Compare” is its native A/B test on thumbnails. Test the hook, not the algorithm.
WHEN TO TEST
Why A/B testing matters: every business decision is a hypothesis. A/B testing is the only practical way to falsify a marketing-or-product hypothesis with money on the table. The alternative — HiPPO (Highest-Paid Person’s Opinion) — has roughly the predictive accuracy of a coin flip, except the coin flip is free.
FAILURE MODES
| Failure | What it looks like | Fix |
|---|---|---|
| Peeking | Checking the dashboard daily, stopping the moment confidence hits 95% | Lock the runtime to the pre-calculated sample size; use Bayesian or sequential-testing inference if you must monitor continuously |
| Underpowered tests | “We tested it for two weeks but the result wasn’t significant” | Pre-calculate sample size; if you can’t reach it in 4 weeks, the change isn’t worth A/B testing — ship a qualitative-validated variant instead |
| Testing too many variables | Hero swap that changes headline + image + button + colour all at once | One variable per test; sequence them |
| Ignoring negative results | Only logging the wins | Negative results are evidence — they tell you which lever this audience doesn’t move on. Log them; they compound into a research dataset |
| No hypothesis | “Let’s test this and see what happens” | Hypothesis must be falsifiable: “If we change X to Y, [metric] will increase by ≥ N% because [insight]” |
| Confounded traffic | Running a paid-search campaign concurrent with a homepage test | Hold confounders constant or use a paired-cohort analysis. Calendar lifts (Black Friday, school holidays) wreck unconfounded tests |
| No downstream audit | Winning variant ships, then refund rate spikes | Audit downstream metrics for 60 days post-ship: LTV, refund rate, NPS, support volume |
Industry average is 2–3 A/B tests per quarter. The programmes that produce 28–34% lifts run 30+ tests per quarter and accept a 70%+ negative-result rate — because the 30% that win compound.
FAQ
A/B testing is showing two versions of the same thing (a page, an email, an ad) to two random groups of people and using a statistical test to decide which version produced more of the result you care about — purchases, signups, clicks, retained users.
Form a hypothesis, calculate the sample size needed at your confidence threshold, build the variant, split traffic 50/50 randomly, let the test reach significance and sample size before deciding, then audit the winner downstream for 60 days.
Nothing — they’re synonyms. Some use “split testing” specifically for URL-split tests (two entirely different page templates from different URLs) and “A/B testing” for same-URL element swaps, but the statistical method is identical.
A/B testing for almost everything. Multivariate testing only when you have over 100,000 monthly conversions and specifically need element-interaction effects. Most teams should run sequential A/B tests instead.
Yes, extensively. Netflix publishes their methodology on the Netflix TechBlog and runs tests on artwork personalisation, autoplay behaviour, UI changes, and recommendation algorithms — all decision-graded on retention and watch-time.
Optimizely or AB Tasty for enterprise. VWO or Convert for mid-market. PostHog, Statsig, or GrowthBook for product-led teams. Mailchimp or Klaviyo for email. Google Optimize was sunset in September 2023.
Three reasons: the hypothesis wasn’t strong enough, the test was underpowered, or the test was stopped early by peeking. Fix the hypothesis with user research, fix the sample size with pre-launch calculation, fix peeking with a locked runtime.
Until two conditions are both true: the pre-calculated sample size is reached for each variant, and statistical significance crosses your confidence threshold (we use 99%). For most sites with 1,000+ monthly conversions, that’s 14–28 days.
Evan Miller’s free sample-size calculator and blog, CXL’s “Statistics for A/B Testing” course, the Optimizely and VWO knowledge bases, and Ronny Kohavi’s book Trustworthy Online Controlled Experiments.
Because every business decision is a hypothesis, and A/B testing is the only practical way to falsify a marketing or product hypothesis with money on the line. A site running 30 tests a quarter at a 30% win rate compounds 9 wins per quarter. A site running 2 tests a quarter at the same win rate compounds 0.6.
FREE 99-RULE AUDIT
Free 15-minute call. We’ll look at the last 6 months of your test log, flag tests that called winners on peeking or under-power, and identify the 3 highest-priority test hypotheses for the next quarter — based on the 347 Method (Build Grow Scale’s research across 347 stores) and the 4-to-34 Gap framework.
Book my free A/B testing audit →RELATED
RELATED BLOG POSTS
COMING SOON
Hypothesis prioritisation frameworks, sample-size maths, multi-variate vs A/B, and reading interaction effects.
Book my free AI audit →A/B testing splits your traffic between two versions of a page or element and measures which one converts better. Version A is the control. Version B is the variant. You run them in parallel against the same audience until the statistical engine confirms one beat the other. The discipline is in the maths, not the dashboard.
You set a hypothesis (if we change X for Y audience, the conversion rate moves because Z mechanism), compute minimum sample size before launch, run the test until that sample is reached, then read the result at 99% statistical significance. GoGoChimp calls this The Evidence Stack. Four layers: hypothesis priority, sample-size discipline, the stopping rule, failure-as-information.
Three formats cover almost every engagement. Standard A/B (two variants, one element). Multivariate (multiple elements changing at once, useful when traffic is high and you want to test combinations). Split-URL (two different pages on different URLs, used for radical redesigns). The first is most defensible. The third is the riskiest because it confounds layout and content changes.
They are the same thing in practice. "Split testing" is the older term and still common in email marketing and direct-response copywriting circles. "A/B testing" is the term software platforms standardised on. The discipline is identical: two variants, one audience, a statistical engine declaring a winner. If anyone tells you they are different, ask them to define both.
The 95% default ships one false positive in every twenty winners. On a 30-test quarter that is 1.5 ghost wins deployed to production. Over a year, six. The 99 Rule cuts that to one false positive in every 100. The trade-off is roughly 30-50% longer test duration. On any programme above 20 tests per quarter, the maths inverts in favour of 99%.
The maths is brutal here. If you are running fewer than 1,000 sessions per variant per week, you do not have an A/B testing problem, you have a traffic problem. Minimum useful sample size depends on baseline conversion rate and minimum detectable effect, but a workable rule is 1,000 sessions per variant for surface-level tests and 5,000+ per variant for subtle copy or layout tests at 99% significance.
VWO, Convert, AB Tasty, and Optimizely across the client roster, plus Microsoft Clarity and Hotjar for behavioural overlay. Tool choice is downstream of methodology. OperatorAI (GoGoChimp's CRO methodology, distinct from OpenAI's Operator agent product) sets the hypothesis and the threshold; the platform executes the test. Every platform defaults to 95%; the override to 99% is manual on every test.
Enzymedica UK, December 2021. Baseline conversion rate 3.4%. Three compounded operator-set tests, all declared at 99% significance. Black Friday closed at 16.9% (a 4.97x lift) and December sustained at 11% during the worst month of the year for health supplements. Loom analytics review on file. That is what The Evidence Stack produces when you run it end-to-end on a real engagement.
A 60-second AI scan shows which page-speed issues are leaking conversions on your homepage, and the £/month each one is costing your revenue.
✓ Built on Build Grow Scale's 347-store CRO research
✓ Avg 28-34% lift (expert-led AI CRO benchmark)
✓ Free, no signup