Shopify A/B Test Sample Size Calculator | PMD

May 26, 2026

Two distributions, one decision. Sample size is what separates a real shift from statistical noise.

The biggest reason Shopify CRO tests fail isn't bad hypotheses. It is bad maths up front.

Brands launch a test, leave it running for two weeks, declare a winner because the lift hit 95% confidence on day eight, and ship a change that does nothing once it rolls out to 100% of traffic. The post-mortem always lands at the same place: there were never enough conversions to know either way.

A Shopify A/B test sample size calculator stops that from happening. It tells you, before you write a single line of code, how many sessions each variant needs to detect a real change at the lift you actually care about. Skip this maths and you are not running CRO; you are running a coin-flip and hoping it agrees with your gut.

Quick answer: a Shopify A/B test sample size calculator works out how many visitors each variant needs before the test can reliably show a true difference in conversion rate. It combines your current baseline conversion rate, the minimum detectable effect (MDE) you would call a win, statistical power (usually 80%) and significance level (usually 95%). For most £5–50M Shopify brands, detecting a 5% relative lift on a 2% baseline CR needs roughly 130,000 sessions per variant.

What a Shopify A/B Test Sample Size Calculator Actually Calculates

The calculator does one job. It takes your four inputs and tells you how big each variant needs to be, in sessions, in conversions, or in days at your current traffic, to detect a difference of a given size at a given confidence level.

It is not a magic box. It will not tell you which variant will win. It tells you the floor below which the test result is meaningless, no matter what number Convert, VWO or your in-house Shopify experiments tool throws up on day fourteen.

Three things every CRO-mature Shopify brand we work with (Cadence, Routine, MyoMaster) gets right. They calculate sample size before the test ships, not after. They commit to the number before any data is collected, not when impatience sets in. And they run for full business cycles even when interim results look good.

Most agencies skip this step. It is the cheapest, hardest-to-fake correction we make on every new audit.

The Inputs Every Sample Size Calculation Needs

Four inputs. Everything else is noise.

Baseline conversion rate. The current conversion rate of the page or flow you are testing. Pull this from Shopify analytics for the last 28–56 days, segmented to the same traffic source the test will see. If your homepage converts at 1.8% on cold paid traffic but 4.2% on returning organic, do not mix them. The calculator only works on one population at a time.

Minimum detectable effect (MDE). The smallest relative lift you are prepared to call a win. This is the input most teams get wrong. They set MDE at 2% because "a 2% lift would still be nice", then cannot understand why the test demands 400,000 sessions. The smaller the MDE, the bigger the sample required. The relationship is not linear: halving your MDE roughly quadruples the sample. Pick an MDE that reflects the smallest lift that is commercially worth shipping, not the smallest lift you would be flattered by.

Statistical power (1 - β). The probability the test detects a real lift when one exists. Industry default is 80%. Going to 90% increases your sample by roughly 30%. We use 80% on everything except revenue-critical checkout tests, where we take the hit and run at 90%.

Significance level (α). The probability of a false positive. Default 5%, which gives you 95% confidence. Anyone telling you they run at 99% on a Shopify brand below £100M ARR is either running tests for a year each, or massaging the report.

Four inputs, one number out. The maths is settled. The discipline is in honouring it.

Plug those four numbers in and you get sessions per variant. Reference table below for binary CR tests at 80% power, 95% confidence, two-tailed.

Baseline CR	MDE 3%	MDE 5%	MDE 8%	MDE 10%
1%	730,000	263,000	103,000	66,000
2%	362,000	130,000	51,000	33,000
3%	240,000	86,000	34,000	22,000
5%	141,000	51,000	20,000	13,000
10%	67,000	24,000	9,500	6,100

Read that table carefully. A brand with a 2% homepage CR who wants to detect a 3% relative lift needs 362,000 sessions per variant. At 80,000 monthly sessions that is a 9-month test, which usually means widening the MDE, narrowing scope or scaling traffic first.

CRO Obsessed

Half of all "winning" Shopify A/B tests we audit were called before they were statistically powered. The replay rate is brutal.

PM Digital Design is a full-funnel Shopify CRO and profit-optimisation agency. We help subscription and high-LTV Shopify brands — including Cadence, Routine, Maelove, and others — fix the funnel between their ad spend and their profit.

Book a 30-min call with Paddy McLarnon →

Worked Example 1: A Supplement Brand Testing Subscription Opt-in

The scenario. An £8M ARR supplement brand running a pre-checkout subscription opt-in test, sitting in the same territory as the work we documented in our Shopify subscription optimisation breakdown. Current subscription take rate at checkout is 38%. They want to roll out a new opt-in card design, and they think it could push take rate towards what we have achieved with Cadence (a subscription brand we work with that moved their checkout subscription take rate to roughly 70% through full-funnel CRO).

Inputs:

Baseline conversion rate: 38% (subscription take rate at checkout)
MDE: 5% relative lift (they would ship anything that moves take rate from 38% to 39.9% or higher)
Power: 80%
Significance: 95%

Output: approximately 5,200 sessions per variant. Total: 10,400 sessions across both variants.

At 25,000 monthly checkout sessions they reach this in roughly 13 days. The team's instinct was to call the test at day 7 because variant B was up 11% relative. The maths said wait: the confidence interval at day 7 still spanned a range that included zero lift. They waited. By day 14 the lift settled at 8.4% relative and held. They shipped.

The point is not the win. It is that without the sample size calculation, they would have shipped on day 7 with a result the maths said could plausibly be noise.

Worked Example 2: An Apparel Brand Testing PDP Layout

The scenario. A £12M apparel brand we audited testing a new PDP layout. Current PDP-to-cart rate is 6.8%. They wanted to test a new gallery design and below-fold layout.

Inputs:

Baseline: 6.8%
MDE: 8% relative lift (the team agreed they would not roll out anything below an 8% bump for a build of that size)
Power: 80%
Significance: 95%

Output: approximately 32,000 sessions per variant. Total: 64,000 sessions.

The brand does 180,000 PDP sessions per month. That is a 10–11 day test if traffic is even. In practice traffic is rarely even (Mondays and Tuesdays are slow, weekends are heavy), so the actual run was 15 days to cover two full business cycles.

What nearly killed the original plan: the founder wanted to "see results faster" and proposed running it for 7 days. At 7 days the test would have collected ~40,000 sessions, below the powered threshold, and the interim significance flag would have triggered around day 4 with a spurious winner call.

The MDE assumption matters more than the baseline. If they had dropped MDE to 3% (because a 3% lift on a brand that size is still £400k+ ARR), the sample size jumps to roughly 220,000 sessions per variant and the test becomes a 7-week commitment. That is a different decision entirely, and worth having before you ship the design brief.

Worked Example 3: A High-AOV Brand Testing Cart Page

The scenario. A high-AOV homeware brand with £180 AOV testing a cart-page threshold mechanic, the same lever that produced our £25,535 cart-threshold split-test win for a different client. Current cart-to-checkout rate is 64%.

Inputs:

Baseline: 64%
MDE: 4% relative lift (the team modelled that anything above a 2.5% lift was commercially worth shipping; they set MDE at 4% to give themselves headroom)
Power: 80%
Significance: 95%

Output: approximately 4,400 sessions per variant. Total: 8,800 sessions.

The brand does 9,000 monthly cart sessions. That is a 30-day test. That feels long, until you remember the alternative is running it for 14 days, calling it inconclusive, and never knowing whether you left £25k a month on the table.

When the baseline is already high (60%+ in this case) the absolute change required to detect a given relative lift is small in proportional terms, which is why the sample is smaller than the apparel example. High baselines move on smaller samples. Low baselines (under 2%) are brutal.

The Mistakes That Wreck Your Sample Size Maths

Five mistakes we see in almost every audit cycle.

Calculating sample size in conversions instead of sessions. Some tools quote you "200 conversions per variant". That is fine if your baseline is stable, but if your CR drifts during the test, your conversion target drifts with it. Always work in sessions.

Treating MDE as a wish list. MDE is not the lift you want. It is the lift you would ship. If you would genuinely ship a 2% lift, set MDE at 2% and accept the sample size cost. If you would not, do not set it that low for vanity.

Running unrelated traffic sources through the same test. Paid social, paid search and organic convert differently. Mixing them inflates variance and demands a bigger sample than your calculator quoted. Always segment by traffic source up front.

Stopping early on a peeking-induced "win". If you commit to 10,000 sessions, run 10,000 sessions. Stopping on day 6 because the result looks good is the single most expensive habit in DTC CRO. We covered the full mechanics in our guide to A/B test duration; pair it with this article and your test calendar gets considerably more honest.

Sending tablet, mobile and desktop traffic into the same bucket. Device-level conversion rates differ by 40–60% in our experience. Treating them as one population works for sample size, but only if your variant has the same effect on all devices. It rarely does.

Your Monday-Morning Sample Size Checklist

Before any test ships, work through this list. If you cannot answer cleanly, the test is not ready.

Pull baseline CR from Shopify analytics for the test's specific traffic source, last 28 days.
Set MDE at the smallest relative lift you would actually ship, not the smallest you would celebrate.
Use 80% power and 95% confidence unless this test gates a revenue-critical decision; in which case go to 90% power.
Calculate sessions per variant; multiply by number of variants for the total.
Divide total by daily sessions to get duration; round up to a full week (cover two business cycles where possible).
Write the number into the test brief and your team Slack. Pre-commit.
Do not look at the dashboard until you hit the number. Peek only if a guardrail metric (revenue per session, page-error rate) goes sideways.

That is it. No clever tricks, no proprietary maths. Most of the value in a Shopify CRO programme is built on this single act of restraint, run quarter after quarter, and the compounding edge over teams that don't honour it is large.

If you want a sense of where this thinking sits inside our wider engagement, our case studies page walks through what tests look like once the sample-size discipline is in place. For deeper foundational material on test design, the PMD CRO learning hub collects our video tutorials, podcasts, and longer-form CRO breakdowns. Worth bookmarking if you are building out an in-house testing programme.

FAQs

What sample size do I need for a Shopify A/B test if my conversion rate is 2%?

At 2% baseline CR, 5% relative MDE, 80% power and 95% confidence, you need roughly 130,000 sessions per variant. If you are below 100,000 monthly sessions on the page tested, you have three options: a longer test, a bigger MDE, or accept the test isn't worth running.

Can I use a free Shopify A/B test sample size calculator instead of paying for software?

Yes. Evan Miller's calculator is the de facto standard for binary tests, and AB Testguide and VWO both publish free versions. The free maths is identical to the paid maths. What you pay for in tools like Convert and VWO is the orchestration, traffic-splitting, and analytics, not the calculation itself.

What is the minimum traffic to run A/B tests on Shopify?

No absolute minimum, but for tests with 5% MDE and 2–3% baseline CR, expect 50,000+ sessions per variant. Sub-£2M ARR brands usually do not have the volume for true split testing. We push smaller brands towards qualitative research and conversion-blocker audits first, then build a testing programme as traffic scales.

How does sample size change if I test 3 variants instead of 2?

Same sample size per variant, plus a correction for multiple comparisons (Bonferroni or similar). In practice: multiply the two-variant sample by 1.5–2x to be safe. We rarely run more than two variants per test on Shopify brands below £50M ARR.

Should I include returning visitors in my sample size calculation?

Only if your hypothesis applies to returning visitors. If you are testing a homepage redesign aimed at cold paid traffic, segment to new visitors only. Mixing populations inflates the sample you actually need and confounds the result.

Where can I learn more about CRO test design and methodology?

The PMD CRO learning hub is the right starting point. It collects video tutorials on test design, our podcast with operators inside £10–200M Shopify brands, and longer-form breakdowns on advertorials and landing pages. If you would rather walk through your testing roadmap directly, you can book a 30-minute call with Paddy McLarnon.

Full-funnel CRO. Profit obsessed.

Want this on your store?

We help subscription and high-LTV Shopify brands — including Cadence, Routine, Maelove, MyoMaster, and others — turn cold traffic into post-click profit. Strategy, copy, design, development and CRO under one roof.

Book a 30-min with Paddy McLarnon See PMD's full-funnel CRO work

Back to blog