A/B Testing: Statistical-Significance Conversion Testing

DEFINITION

What is A/B testing?

A/B testing is a controlled experiment that splits one audience between two versions of the same page, email, ad, or product feature - Version A (the control) and Version B (the variant) - then uses a statistical-significance test to decide which version won. The winner is the version that produces more of the target outcome (purchases, signups, clicks, retained users) by an amount large enough that it could not plausibly be random.

In plain English: when you can’t agree which headline, button colour, pricing layout, or onboarding flow is better, you stop arguing and let the people who actually buy from you decide. A/B testing is the operator’s version of opinion settlement.

Read the full CRO definition →

The term “A/B testing” is used interchangeably with split testing - they describe the same method. Multivariate testing (MVT) is a different beast: it tests many element combinations at once and needs roughly 5–10× the traffic to reach significance.

PROCESS

How A/B testing works (step-by-step)

The textbook version of A/B testing has six steps. The version that ships winners has eight.

Form a hypothesis. Not “test this headline” - but “If we change the headline from X to Y, conversion will increase by ≥ N% because [specific user-research insight].” A hypothesis with no falsifiable prediction is a guess in a suit.
Estimate sample size before launching. Use a sample-size calculator (Optimizely, AB Tasty, VWO, or Evan Miller’s free version). Inputs: baseline conversion rate, minimum detectable effect (MDE), statistical power (typically 80%), significance threshold (typically 95% - we use 99%; see The 99 Rule).
Build the variant. Keep one variable per test. If you change the headline AND the button AND the image, you have a multivariate test, not an A/B test.
Randomly split traffic 50/50 using the testing platform’s traffic-allocation engine. Visitors should be evenly cohorted and cookie-persisted so the same visitor sees the same variant.
Let the test run. Do not stop early. Do not “peek” daily and trust the result. Peeking inflates false-positive rates dramatically.
Reach significance and the pre-calculated sample size. Both conditions must be true. Hitting 95% confidence at 200 sessions is statistical noise, not a winner.
Decide. Ship the winner, kill the loser, or - if neither variant beat control - log the negative result. Negative results are data.
Audit downstream. A winning variant still has to not break checkout, not regress mobile load times, and not cannibalise downstream metrics (lifetime value, refund rate, support tickets).

TYPES

A/B testing vs split testing vs multivariate testing

Method	What it tests	Traffic required	Best for
A/B test	One variable, two versions	Moderate (~1,000 conversions/variant)	Single-element changes: headlines, CTAs, pricing labels, hero images
Split test (URL split)	Two entirely different page templates served from different URLs	Same as A/B	Full-page redesigns where you can’t variant-swap a single element
Multivariate (MVT)	Multiple variables × multiple values, all combinations tested simultaneously	~5–10× an A/B test	Mature programmes with high traffic that need to find element-interaction effects

A/B and split tests are the same statistical model — the only difference is where the variant lives (same URL with element swap vs different URL with whole-page swap). The decision is implementation, not method.

Multivariate testing is rarely the right choice for sub-£10M/year sites. The traffic requirement kills it before you reach significance. Most teams who think they need MVT actually need to run sequential A/B tests — find the winning headline, then find the winning button, then find the winning image — each as a separate, fully-powered experiment.

STATISTICAL SIGNIFICANCE

Sample size and statistical significance: The 99 Rule

The single most common A/B testing failure is calling winners on tests that never reached significance. The fix is two numbers, locked in before the test launches:

Confidence threshold: the probability that the observed difference is not random. Industry default is 95%. GoGoChimp uses 99% — we call this The 99 Rule. The maths is the same; we accept fewer false positives in exchange for slightly longer test cycles. On a £10M store, one false positive that ships costs more than two extra weeks of test runtime.
Sample size: the minimum number of conversions per variant needed to detect your minimum detectable effect (MDE) at the chosen confidence. For a baseline 3% conversion rate and a 10% relative MDE at 99% confidence, you need roughly 38,000 sessions per variant. Halve the MDE and the sample size 4×.

The “peeking problem” — looking at a test daily and stopping the moment confidence crosses 95% — turns a 5% nominal false-positive rate into a 30%+ effective false-positive rate. The fix is sequential-testing methods (Always Valid Inference, used by Optimizely Stats Accelerator) or Bayesian inference (used by VWO SmartStats), both of which are valid under continuous monitoring.

For most programmes, the simpler fix is: pre-calculate the sample size, set a calendar reminder for the runtime, and don’t open the dashboard until that date.

EXAMPLES

A/B testing examples — what category leaders actually test

Does Netflix use A/B testing? Yes — Netflix is one of the most aggressive A/B testers in software. They publish technical posts on the methodology under “Netflix TechBlog.” Public examples: artwork-personalisation tests where the title-card image varies by viewer history (different posters for different audience cohorts on the same show), autoplay-trailer tests, and the original “Skip Intro” button rollout — all decision-graded on retention and watch-time metrics, not just clicks.

Meta / Facebook Ads: Meta’s “A/B Test” feature inside Ads Manager runs split-traffic experiments across ad sets to find the highest cost-per-result configuration. The same statistical engine powers Meta’s organic experiments — feed-ranking changes are A/B-tested on small user cohorts before global rollout.

GoGoChimp named-client A/B testing wins

Enzymedica UK: product-page A/B test sequence (hero copy, badge placement, urgency framing) lifted conversion from 3.4% → 16.9% — a 397% relative increase across the tested product line.
Super Area Rugs: category-page and PDP A/B tests delivered +216% conversion increase in 37 days.
Helix Binders: quote-request form A/B test (single-step vs multi-step) tripled qualified leads in 11 days.
Donate For Charity: donation-flow A/B test (form length + trust signals) lifted conversions +494% in 30 days.

The pattern: hypothesis-led, single-variable, 99%-confidence, no peeking. The 4-to-34 Gap framework explains why operator-led A/B testing programmes hit 28–34% lifts while self-serve AI tools cluster at 4–7%.

TOOLS

A/B testing tools and platforms in 2026

Platform	Best for	Pricing model	Notes
Optimizely	Enterprise, multi-product testing	Custom (~£50K+/year)	Stats Accelerator handles peeking-safe inference
VWO	Mid-market ecommerce + SaaS	From ~£300/month	SmartStats uses Bayesian — peeking-safe by default
Google Optimize	(Sunset Sept 2023)	—	Migrate to GA4 + a paid tool
Convert	Privacy-first, GDPR-strict	From ~£300/month	EU data residency, no third-party cookies required
AB Tasty	Enterprise + personalisation	Custom	Strong feature-flagging integration
PostHog / Statsig / GrowthBook	Product-led teams	From free	Code-deployed, feature-flag-native, engineer-friendly
Mailchimp / Klaviyo	Email-only A/B testing	Included	Subject-line + send-time tests built in
Meta Business Suite	Ad-creative + audience splits	Included	Native split-test inside Ads Manager

The right tool is the cheapest one your team will actually use. A £1,000/month platform that no one logs into produces zero tests. A free PostHog instance owned by an engineer who runs four tests a month produces a 28–34% lift after a year.

DISCIPLINES

A/B testing across disciplines

A/B testing in digital marketing: subject lines, ad creative, landing-page headlines, audience segments. The variable is messaging; the metric is conversion or cost-per-result.

A/B testing in email marketing: subject line, send time, sender name, body copy, CTA placement. Mailchimp and Klaviyo run these natively against open-rate or click-rate.

A/B testing in Facebook ads / Meta Ads: ad-set splits on creative, audience, placement, or bid strategy. Meta’s native A/B Test feature runs these statistically.

A/B testing in product management: feature rollouts. Show new feature to 50% of users, measure activation/retention impact, decide based on the statistical comparison. Tools: PostHog, Statsig, LaunchDarkly.

A/B testing in data science: the underlying frequentist or Bayesian inference problem — two samples, one test statistic, one p-value or posterior. SQL-driven A/B tests pull conversion counts from event tables and compute significance in-database.

A/B testing in machine learning: champion/challenger model rollouts. Production traffic is split between a current-best model and a candidate, measuring against the business metric the model is meant to move (CTR, conversion, latency, downstream LTV).

A/B testing in QA: controlled rollout of code changes (often called canary deployments or shadow traffic) — different from marketing A/B testing in intent (find bugs vs find lifts) but mathematically identical (split traffic, compare outcomes).

A/B testing in social media: TikTok, YouTube, and Instagram all let creators test thumbnails, hooks, or titles. YouTube’s “Test & Compare” is its native A/B test on thumbnails. Test the hook, not the algorithm.

WHEN TO TEST

When to run A/B tests — and when not to

Run an A/B test when:

You have ≥ 1,000 conversions/month on the page or flow you’re testing (anything less takes too long to reach 99% confidence).
The change is reversible. Pricing tests, headline tests, layout tests — all reversible. Brand-positioning changes that touch every page are not.
You have a falsifiable hypothesis backed by qualitative data (heatmaps, session replays, user interviews) — not a guess.
The variable you’re testing moves a metric you can measure within the test window.

Skip the A/B test (just ship it) when:

The change is a bug fix or a known accessibility/legal requirement.
Traffic is too low — sub-1,000 monthly conversions means six-month test windows. Better to ship a sequence of qualitative-validated changes and measure pre/post.
You’re testing two variants you’d both happily ship — the test runtime cost exceeds the decision value.
The change is brand-coherence-driven, not conversion-driven (logo, voice, mission statement).

Why A/B testing matters: every business decision is a hypothesis. A/B testing is the only practical way to falsify a marketing-or-product hypothesis with money on the table. The alternative — HiPPO (Highest-Paid Person’s Opinion) — has roughly the predictive accuracy of a coin flip, except the coin flip is free.

FAILURE MODES

Common A/B testing failures (and the fix)

Failure	What it looks like	Fix
Peeking	Checking the dashboard daily, stopping the moment confidence hits 95%	Lock the runtime to the pre-calculated sample size; use Bayesian or sequential-testing inference if you must monitor continuously
Underpowered tests	“We tested it for two weeks but the result wasn’t significant”	Pre-calculate sample size; if you can’t reach it in 4 weeks, the change isn’t worth A/B testing — ship a qualitative-validated variant instead
Testing too many variables	Hero swap that changes headline + image + button + colour all at once	One variable per test; sequence them
Ignoring negative results	Only logging the wins	Negative results are evidence — they tell you which lever this audience doesn’t move on. Log them; they compound into a research dataset
No hypothesis	“Let’s test this and see what happens”	Hypothesis must be falsifiable: “If we change X to Y, [metric] will increase by ≥ N% because [insight]”
Confounded traffic	Running a paid-search campaign concurrent with a homepage test	Hold confounders constant or use a paired-cohort analysis. Calendar lifts (Black Friday, school holidays) wreck unconfounded tests
No downstream audit	Winning variant ships, then refund rate spikes	Audit downstream metrics for 60 days post-ship: LTV, refund rate, NPS, support volume

Industry average is 2–3 A/B tests per quarter. The programmes that produce 28–34% lifts run 30+ tests per quarter and accept a 70%+ negative-result rate — because the 30% that win compound.

FAQ

A/B testing FAQ

What is A/B testing in simple terms?

A/B testing is showing two versions of the same thing (a page, an email, an ad) to two random groups of people and using a statistical test to decide which version produced more of the result you care about — purchases, signups, clicks, retained users.

How is A/B testing done?

Form a hypothesis, calculate the sample size needed at your confidence threshold, build the variant, split traffic 50/50 randomly, let the test reach significance and sample size before deciding, then audit the winner downstream for 60 days.

What is the difference between A/B testing and split testing?

Nothing — they’re synonyms. Some use “split testing” specifically for URL-split tests (two entirely different page templates from different URLs) and “A/B testing” for same-URL element swaps, but the statistical method is identical.

A/B testing vs multivariate testing — which should I use?

A/B testing for almost everything. Multivariate testing only when you have over 100,000 monthly conversions and specifically need element-interaction effects. Most teams should run sequential A/B tests instead.

Does Netflix use A/B testing?

Yes, extensively. Netflix publishes their methodology on the Netflix TechBlog and runs tests on artwork personalisation, autoplay behaviour, UI changes, and recommendation algorithms — all decision-graded on retention and watch-time.

What are the best A/B testing tools in 2026?

Optimizely or AB Tasty for enterprise. VWO or Convert for mid-market. PostHog, Statsig, or GrowthBook for product-led teams. Mailchimp or Klaviyo for email. Google Optimize was sunset in September 2023.

Why do most A/B tests fail to find a winner?

Three reasons: the hypothesis wasn’t strong enough, the test was underpowered, or the test was stopped early by peeking. Fix the hypothesis with user research, fix the sample size with pre-launch calculation, fix peeking with a locked runtime.

How long should an A/B test run?

Until two conditions are both true: the pre-calculated sample size is reached for each variant, and statistical significance crosses your confidence threshold (we use 99%). For most sites with 1,000+ monthly conversions, that’s 14–28 days.

Where can I learn A/B testing?

Evan Miller’s free sample-size calculator and blog, CXL’s “Statistics for A/B Testing” course, the Optimizely and VWO knowledge bases, and Ronny Kohavi’s book Trustworthy Online Controlled Experiments.

Why is A/B testing important?

Because every business decision is a hypothesis, and A/B testing is the only practical way to falsify a marketing or product hypothesis with money on the line. A site running 30 tests a quarter at a 30% win rate compounds 9 wins per quarter. A site running 2 tests a quarter at the same win rate compounds 0.6.

FREE 99-RULE AUDIT

Want a 99-Rule audit of your A/B testing programme?

Free 15-minute call. We’ll look at the last 6 months of your test log, flag tests that called winners on peeking or under-power, and identify the 3 highest-priority test hypotheses for the next quarter — based on the 347 Method (Build Grow Scale’s research across 347 stores) and the 4-to-34 Gap framework.

Book my free A/B testing audit →

Compare and decide

Go deeper on A/B testing

DEEPER ON A/B TESTING DECISIONS

Choosing the right A/B testing programme

Four honest comparisons for buyers weighing DIY tools, training, and operator-led testing programmes.

Nominated — Digital Doughnut Digital Marketing Agency of the Year 2021

A/B Testing — GoGoChimp Blog

What is A/B testing?

How A/B testing works (step-by-step)

A/B testing vs split testing vs multivariate testing

Sample size and statistical significance: The 99 Rule

A/B testing examples — what category leaders actually test

GoGoChimp named-client A/B testing wins

A/B testing tools and platforms in 2026

A/B testing across disciplines

When to run A/B tests — and when not to

Run an A/B test when:

Skip the A/B test (just ship it) when:

Common A/B testing failures (and the fix)

A/B testing FAQ

What is A/B testing in simple terms?

How is A/B testing done?

What is the difference between A/B testing and split testing?

A/B testing vs multivariate testing — which should I use?

Does Netflix use A/B testing?

What are the best A/B testing tools in 2026?

Why do most A/B tests fail to find a winner?

How long should an A/B test run?

Where can I learn A/B testing?

Why is A/B testing important?

Want a 99-Rule audit of your A/B testing programme?

Compare and decide

Go deeper on A/B testing

Choosing the right A/B testing programme

A/B testing FAQ

What is A/B testing in simple terms?

How is A/B testing done properly?

What are the types of A/B testing?

What is the difference between A/B testing and split testing?

Why does GoGoChimp test at 99% statistical significance instead of 95%?

How much traffic do I need to run A/B testing?

What tools does GoGoChimp use for A/B testing?

What is a real A/B testing result that proves the discipline works?

A/B Testing — GoGoChimp Blog

What is A/B testing?

How A/B testing works (step-by-step)

A/B testing vs split testing vs multivariate testing

Sample size and statistical significance: The 99 Rule

A/B testing examples — what category leaders actually test

GoGoChimp named-client A/B testing wins

A/B testing tools and platforms in 2026

A/B testing across disciplines

When to run A/B tests — and when not to

Run an A/B test when:

Skip the A/B test (just ship it) when:

Common A/B testing failures (and the fix)

A/B testing FAQ

What is A/B testing in simple terms?

How is A/B testing done?

What is the difference between A/B testing and split testing?

A/B testing vs multivariate testing — which should I use?

Does Netflix use A/B testing?

What are the best A/B testing tools in 2026?

Why do most A/B tests fail to find a winner?

How long should an A/B test run?

Where can I learn A/B testing?

Why is A/B testing important?

Want a 99-Rule audit of your A/B testing programme?

Compare and decide

Go deeper on A/B testing

Choosing the right A/B testing programme

A/B testing FAQ

What is A/B testing in simple terms?

How is A/B testing done properly?

What are the types of A/B testing?

What is the difference between A/B testing and split testing?

Why does GoGoChimp test at 99% statistical significance instead of 95%?

How much traffic do I need to run A/B testing?

What tools does GoGoChimp use for A/B testing?

What is a real A/B testing result that proves the discipline works?

From the GoGoChimp blog

The ICE framework is broken. Here's what to use instead for A/B test prioritisation

Why Most A/B Tests Find a Local Maximum (and How to Escape It)

A/B Testing Foundations: The Math, the History, and Message Match

How to Build a High-Converting Landing Page (the OperatorAI Build)