PILLAR

A/B Testing — GoGoChimp Blog

A/B testing is the operator’s version of opinion settlement. Two variants, statistical significance, and a winner the data picks — not the highest-paid person in the meeting.

Most teams fail A/B testing in one of three ways: they test too few hypotheses to statistically notice winners, they stop tests at the wrong moment, or they test on traffic volumes that make results meaningless. Industry average is 2–3 tests per quarter. The testing programmes that transform businesses run 30+.

Every post walks through one part of that work: hypothesis prioritisation, sample-size maths, multi-variate vs A/B, reading interaction effects, and shipping winners without regressions.

DEFINITION

What is A/B testing?

A/B testing is a controlled experiment that splits one audience between two versions of the same page, email, ad, or product feature — Version A (the control) and Version B (the variant) — then uses a statistical-significance test to decide which version won. The winner is the version that produces more of the target outcome (purchases, signups, clicks, retained users) by an amount large enough that it could not plausibly be random.

In plain English: when you can’t agree which headline, button colour, pricing layout, or onboarding flow is better, you stop arguing and let the people who actually buy from you decide. A/B testing is the operator’s version of opinion settlement.

The term “A/B testing” is used interchangeably with split testing — they describe the same method. Multivariate testing (MVT) is a different beast: it tests many element combinations at once and needs roughly 5–10× the traffic to reach significance.

PROCESS

How A/B testing works (step-by-step)

The textbook version of A/B testing has six steps. The version that ships winners has eight.

  1. Form a hypothesis. Not “test this headline” — but “If we change the headline from X to Y, conversion will increase by ≥ N% because [specific user-research insight].” A hypothesis with no falsifiable prediction is a guess in a suit.
  2. Estimate sample size before launching. Use a sample-size calculator (Optimizely, AB Tasty, VWO, or Evan Miller’s free version). Inputs: baseline conversion rate, minimum detectable effect (MDE), statistical power (typically 80%), significance threshold (typically 95% — we use 99%; see The 99 Rule).
  3. Build the variant. Keep one variable per test. If you change the headline AND the button AND the image, you have a multivariate test, not an A/B test.
  4. Randomly split traffic 50/50 using the testing platform’s traffic-allocation engine. Visitors should be evenly cohorted and cookie-persisted so the same visitor sees the same variant.
  5. Let the test run. Do not stop early. Do not “peek” daily and trust the result. Peeking inflates false-positive rates dramatically.
  6. Reach significance and the pre-calculated sample size. Both conditions must be true. Hitting 95% confidence at 200 sessions is statistical noise, not a winner.
  7. Decide. Ship the winner, kill the loser, or — if neither variant beat control — log the negative result. Negative results are data.
  8. Audit downstream. A winning variant still has to not break checkout, not regress mobile load times, and not cannibalise downstream metrics (lifetime value, refund rate, support tickets).

TYPES

A/B testing vs split testing vs multivariate testing

MethodWhat it testsTraffic requiredBest for
A/B testOne variable, two versionsModerate (~1,000 conversions/variant)Single-element changes: headlines, CTAs, pricing labels, hero images
Split test (URL split)Two entirely different page templates served from different URLsSame as A/BFull-page redesigns where you can’t variant-swap a single element
Multivariate (MVT)Multiple variables × multiple values, all combinations tested simultaneously~5–10× an A/B testMature programmes with high traffic that need to find element-interaction effects

A/B and split tests are the same statistical model — the only difference is where the variant lives (same URL with element swap vs different URL with whole-page swap). The decision is implementation, not method.

Multivariate testing is rarely the right choice for sub-£10M/year sites. The traffic requirement kills it before you reach significance. Most teams who think they need MVT actually need to run sequential A/B tests — find the winning headline, then find the winning button, then find the winning image — each as a separate, fully-powered experiment.

STATISTICAL SIGNIFICANCE

Sample size and statistical significance: The 99 Rule

The single most common A/B testing failure is calling winners on tests that never reached significance. The fix is two numbers, locked in before the test launches:

  • Confidence threshold: the probability that the observed difference is not random. Industry default is 95%. GoGoChimp uses 99% — we call this The 99 Rule. The maths is the same; we accept fewer false positives in exchange for slightly longer test cycles. On a £10M store, one false positive that ships costs more than two extra weeks of test runtime.
  • Sample size: the minimum number of conversions per variant needed to detect your minimum detectable effect (MDE) at the chosen confidence. For a baseline 3% conversion rate and a 10% relative MDE at 99% confidence, you need roughly 38,000 sessions per variant. Halve the MDE and the sample size 4×.

The “peeking problem” — looking at a test daily and stopping the moment confidence crosses 95% — turns a 5% nominal false-positive rate into a 30%+ effective false-positive rate. The fix is sequential-testing methods (Always Valid Inference, used by Optimizely Stats Accelerator) or Bayesian inference (used by VWO SmartStats), both of which are valid under continuous monitoring.

For most programmes, the simpler fix is: pre-calculate the sample size, set a calendar reminder for the runtime, and don’t open the dashboard until that date.

EXAMPLES

A/B testing examples — what category leaders actually test

Does Netflix use A/B testing? Yes — Netflix is one of the most aggressive A/B testers in software. They publish technical posts on the methodology under “Netflix TechBlog.” Public examples: artwork-personalisation tests where the title-card image varies by viewer history (different posters for different audience cohorts on the same show), autoplay-trailer tests, and the original “Skip Intro” button rollout — all decision-graded on retention and watch-time metrics, not just clicks.

Meta / Facebook Ads: Meta’s “A/B Test” feature inside Ads Manager runs split-traffic experiments across ad sets to find the highest cost-per-result configuration. The same statistical engine powers Meta’s organic experiments — feed-ranking changes are A/B-tested on small user cohorts before global rollout.

GoGoChimp named-client A/B testing wins

  • Enzymedica UK: product-page A/B test sequence (hero copy, badge placement, urgency framing) lifted conversion from 3.4% → 16.9% — a 397% relative increase across the tested product line.
  • Super Area Rugs: category-page and PDP A/B tests delivered +216% conversion increase in 37 days.
  • Helix Binders: quote-request form A/B test (single-step vs multi-step) tripled qualified leads in 11 days.
  • Donate For Charity: donation-flow A/B test (form length + trust signals) lifted conversions +494% in 30 days.

The pattern: hypothesis-led, single-variable, 99%-confidence, no peeking. The 4-to-34 Gap framework explains why operator-led A/B testing programmes hit 28–34% lifts while self-serve AI tools cluster at 4–7%.

TOOLS

A/B testing tools and platforms in 2026

PlatformBest forPricing modelNotes
OptimizelyEnterprise, multi-product testingCustom (~£50K+/year)Stats Accelerator handles peeking-safe inference
VWOMid-market ecommerce + SaaSFrom ~£300/monthSmartStats uses Bayesian — peeking-safe by default
Google Optimize(Sunset Sept 2023)Migrate to GA4 + a paid tool
ConvertPrivacy-first, GDPR-strictFrom ~£300/monthEU data residency, no third-party cookies required
AB TastyEnterprise + personalisationCustomStrong feature-flagging integration
PostHog / Statsig / GrowthBookProduct-led teamsFrom freeCode-deployed, feature-flag-native, engineer-friendly
Mailchimp / KlaviyoEmail-only A/B testingIncludedSubject-line + send-time tests built in
Meta Business SuiteAd-creative + audience splitsIncludedNative split-test inside Ads Manager

The right tool is the cheapest one your team will actually use. A £1,000/month platform that no one logs into produces zero tests. A free PostHog instance owned by an engineer who runs four tests a month produces a 28–34% lift after a year.

DISCIPLINES

A/B testing across disciplines

A/B testing in digital marketing: subject lines, ad creative, landing-page headlines, audience segments. The variable is messaging; the metric is conversion or cost-per-result.

A/B testing in email marketing: subject line, send time, sender name, body copy, CTA placement. Mailchimp and Klaviyo run these natively against open-rate or click-rate.

A/B testing in Facebook ads / Meta Ads: ad-set splits on creative, audience, placement, or bid strategy. Meta’s native A/B Test feature runs these statistically.

A/B testing in product management: feature rollouts. Show new feature to 50% of users, measure activation/retention impact, decide based on the statistical comparison. Tools: PostHog, Statsig, LaunchDarkly.

A/B testing in data science: the underlying frequentist or Bayesian inference problem — two samples, one test statistic, one p-value or posterior. SQL-driven A/B tests pull conversion counts from event tables and compute significance in-database.

A/B testing in machine learning: champion/challenger model rollouts. Production traffic is split between a current-best model and a candidate, measuring against the business metric the model is meant to move (CTR, conversion, latency, downstream LTV).

A/B testing in QA: controlled rollout of code changes (often called canary deployments or shadow traffic) — different from marketing A/B testing in intent (find bugs vs find lifts) but mathematically identical (split traffic, compare outcomes).

A/B testing in social media: TikTok, YouTube, and Instagram all let creators test thumbnails, hooks, or titles. YouTube’s “Test & Compare” is its native A/B test on thumbnails. Test the hook, not the algorithm.

WHEN TO TEST

When to run A/B tests — and when not to

Run an A/B test when:

  • You have ≥ 1,000 conversions/month on the page or flow you’re testing (anything less takes too long to reach 99% confidence).
  • The change is reversible. Pricing tests, headline tests, layout tests — all reversible. Brand-positioning changes that touch every page are not.
  • You have a falsifiable hypothesis backed by qualitative data (heatmaps, session replays, user interviews) — not a guess.
  • The variable you’re testing moves a metric you can measure within the test window.

Skip the A/B test (just ship it) when:

  • The change is a bug fix or a known accessibility/legal requirement.
  • Traffic is too low — sub-1,000 monthly conversions means six-month test windows. Better to ship a sequence of qualitative-validated changes and measure pre/post.
  • You’re testing two variants you’d both happily ship — the test runtime cost exceeds the decision value.
  • The change is brand-coherence-driven, not conversion-driven (logo, voice, mission statement).

Why A/B testing matters: every business decision is a hypothesis. A/B testing is the only practical way to falsify a marketing-or-product hypothesis with money on the table. The alternative — HiPPO (Highest-Paid Person’s Opinion) — has roughly the predictive accuracy of a coin flip, except the coin flip is free.

FAILURE MODES

Common A/B testing failures (and the fix)

FailureWhat it looks likeFix
PeekingChecking the dashboard daily, stopping the moment confidence hits 95%Lock the runtime to the pre-calculated sample size; use Bayesian or sequential-testing inference if you must monitor continuously
Underpowered tests“We tested it for two weeks but the result wasn’t significant”Pre-calculate sample size; if you can’t reach it in 4 weeks, the change isn’t worth A/B testing — ship a qualitative-validated variant instead
Testing too many variablesHero swap that changes headline + image + button + colour all at onceOne variable per test; sequence them
Ignoring negative resultsOnly logging the winsNegative results are evidence — they tell you which lever this audience doesn’t move on. Log them; they compound into a research dataset
No hypothesis“Let’s test this and see what happens”Hypothesis must be falsifiable: “If we change X to Y, [metric] will increase by ≥ N% because [insight]”
Confounded trafficRunning a paid-search campaign concurrent with a homepage testHold confounders constant or use a paired-cohort analysis. Calendar lifts (Black Friday, school holidays) wreck unconfounded tests
No downstream auditWinning variant ships, then refund rate spikesAudit downstream metrics for 60 days post-ship: LTV, refund rate, NPS, support volume

Industry average is 2–3 A/B tests per quarter. The programmes that produce 28–34% lifts run 30+ tests per quarter and accept a 70%+ negative-result rate — because the 30% that win compound.

FAQ

A/B testing FAQ

What is A/B testing in simple terms?

A/B testing is showing two versions of the same thing (a page, an email, an ad) to two random groups of people and using a statistical test to decide which version produced more of the result you care about — purchases, signups, clicks, retained users.

How is A/B testing done?

Form a hypothesis, calculate the sample size needed at your confidence threshold, build the variant, split traffic 50/50 randomly, let the test reach significance and sample size before deciding, then audit the winner downstream for 60 days.

What is the difference between A/B testing and split testing?

Nothing — they’re synonyms. Some use “split testing” specifically for URL-split tests (two entirely different page templates from different URLs) and “A/B testing” for same-URL element swaps, but the statistical method is identical.

A/B testing vs multivariate testing — which should I use?

A/B testing for almost everything. Multivariate testing only when you have over 100,000 monthly conversions and specifically need element-interaction effects. Most teams should run sequential A/B tests instead.

Does Netflix use A/B testing?

Yes, extensively. Netflix publishes their methodology on the Netflix TechBlog and runs tests on artwork personalisation, autoplay behaviour, UI changes, and recommendation algorithms — all decision-graded on retention and watch-time.

What are the best A/B testing tools in 2026?

Optimizely or AB Tasty for enterprise. VWO or Convert for mid-market. PostHog, Statsig, or GrowthBook for product-led teams. Mailchimp or Klaviyo for email. Google Optimize was sunset in September 2023.

Why do most A/B tests fail to find a winner?

Three reasons: the hypothesis wasn’t strong enough, the test was underpowered, or the test was stopped early by peeking. Fix the hypothesis with user research, fix the sample size with pre-launch calculation, fix peeking with a locked runtime.

How long should an A/B test run?

Until two conditions are both true: the pre-calculated sample size is reached for each variant, and statistical significance crosses your confidence threshold (we use 99%). For most sites with 1,000+ monthly conversions, that’s 14–28 days.

Where can I learn A/B testing?

Evan Miller’s free sample-size calculator and blog, CXL’s “Statistics for A/B Testing” course, the Optimizely and VWO knowledge bases, and Ronny Kohavi’s book Trustworthy Online Controlled Experiments.

Why is A/B testing important?

Because every business decision is a hypothesis, and A/B testing is the only practical way to falsify a marketing or product hypothesis with money on the line. A site running 30 tests a quarter at a 30% win rate compounds 9 wins per quarter. A site running 2 tests a quarter at the same win rate compounds 0.6.

FREE 99-RULE AUDIT

Want a 99-Rule audit of your A/B testing programme?

Free 15-minute call. We’ll look at the last 6 months of your test log, flag tests that called winners on peeking or under-power, and identify the 3 highest-priority test hypotheses for the next quarter — based on the 347 Method (Build Grow Scale’s research across 347 stores) and the 4-to-34 Gap framework.

Book my free A/B testing audit →

COMING SOON

A/B testing deep-dives landing shortly.

Hypothesis prioritisation frameworks, sample-size maths, multi-variate vs A/B, and reading interaction effects.

Book my free AI audit →
© 2026 GoGoChimp. All rights reserved. Call: 0141 463 6875 - Address: 8 Cheviot Drive, Newton Mearns, Glasgow, G77 5AS
Nominated — Digital Doughnut Digital Marketing Agency of the Year 2021
Shopify Partner — GoGoChimpBRONZEklaviyoK:PARTNERS

A/B testing FAQ

What is A/B testing in simple terms?

A/B testing splits your traffic between two versions of a page or element and measures which one converts better. Version A is the control. Version B is the variant. You run them in parallel against the same audience until the statistical engine confirms one beat the other. The discipline is in the maths, not the dashboard.

How is A/B testing done properly?

You set a hypothesis (if we change X for Y audience, the conversion rate moves because Z mechanism), compute minimum sample size before launch, run the test until that sample is reached, then read the result at 99% statistical significance. GoGoChimp calls this The Evidence Stack. Four layers: hypothesis priority, sample-size discipline, the stopping rule, failure-as-information.

What are the types of A/B testing?

Three formats cover almost every engagement. Standard A/B (two variants, one element). Multivariate (multiple elements changing at once, useful when traffic is high and you want to test combinations). Split-URL (two different pages on different URLs, used for radical redesigns). The first is most defensible. The third is the riskiest because it confounds layout and content changes.

What is the difference between A/B testing and split testing?

They are the same thing in practice. "Split testing" is the older term and still common in email marketing and direct-response copywriting circles. "A/B testing" is the term software platforms standardised on. The discipline is identical: two variants, one audience, a statistical engine declaring a winner. If anyone tells you they are different, ask them to define both.

Why does GoGoChimp test at 99% statistical significance instead of 95%?

The 95% default ships one false positive in every twenty winners. On a 30-test quarter that is 1.5 ghost wins deployed to production. Over a year, six. The 99 Rule cuts that to one false positive in every 100. The trade-off is roughly 30-50% longer test duration. On any programme above 20 tests per quarter, the maths inverts in favour of 99%.

How much traffic do I need to run A/B testing?

The maths is brutal here. If you are running fewer than 1,000 sessions per variant per week, you do not have an A/B testing problem, you have a traffic problem. Minimum useful sample size depends on baseline conversion rate and minimum detectable effect, but a workable rule is 1,000 sessions per variant for surface-level tests and 5,000+ per variant for subtle copy or layout tests at 99% significance.

What tools does GoGoChimp use for A/B testing?

VWO, Convert, AB Tasty, and Optimizely across the client roster, plus Microsoft Clarity and Hotjar for behavioural overlay. Tool choice is downstream of methodology. OperatorAI (GoGoChimp's CRO methodology, distinct from OpenAI's Operator agent product) sets the hypothesis and the threshold; the platform executes the test. Every platform defaults to 95%; the override to 99% is manual on every test.

What is a real A/B testing result that proves the discipline works?

Enzymedica UK, December 2021. Baseline conversion rate 3.4%. Three compounded operator-set tests, all declared at 99% significance. Black Friday closed at 16.9% (a 4.97x lift) and December sustained at 11% during the worst month of the year for health supplements. Loom analytics review on file. That is what The Evidence Stack produces when you run it end-to-end on a real engagement.