How to Brief a CRO Test So It Moves the Number | PMD

A stylised CRO test brief document A flat illustration of a one-page test brief on a desk, with labelled sections for Hypothesis, Primary KPI, Guardrail, Sample Size, Decision Rule, and Owner. A copper pen rests across the page.
A test brief is a one-page contract between you, the data, and the next six weeks of work.

The day I lost faith in vague briefs was a Tuesday in February. We were six weeks into a PDP test for a £22M skincare brand. Long page versus short page, swapping hero copy, tightening the offer. The test had been live for forty-one days. We pulled the data, sat down on Zoom, and within ten minutes the room was at war.

The brand's CMO said the long page was clearly winning. The agency lead said the short page was statistically better on conversion to cart. The CFO, watching quietly, asked the question that ended the meeting: "What were we trying to prove again?" Nobody had a clean answer. The original brief said "improve PDP conversion". That was it. No primary metric. No decision rule. No owner.

We had spent six weeks of dev time, six weeks of traffic, and roughly £40k of paid spend running a test that could not be resolved because the brief had no spine. The "winner" got shipped on vibes, the team lost three weeks debating it, and the next test got rushed because we were now behind schedule. That's the cost of a bad brief. Not the test itself. The arguments after.

I run a CRO agency. I sit in roughly fifteen test reviews a month across our roster, and after a decade of this I can tell you with absolute certainty: most tests don't fail at execution. They fail at the brief. The execution is fine. The dev is fine. The traffic is fine. The brief was sloppy, and the data was always going to be inconclusive because nobody agreed in advance what conclusive meant.

Below are the five things I now refuse to start a test without. If your team is running experiments and arguing about results, this is almost always the gap. For a deeper read on the same problem from a different angle, see our piece on how long an A/B test should actually run, which pairs neatly with the brief mechanics below.

A proper cro test brief starts with a hypothesis, not a guess

The first line of a good brief is not "we should test the hero image". That's a tactic. A hypothesis is a sentence with three parts: what you're changing, what you expect to happen, and why you believe it. The "why" is the bit that separates a serious operator from someone burning traffic.

Compare these two openings:

  • Weak: "We're going to test a new hero image to improve conversion."
  • Strong: "We believe replacing the lifestyle hero with a product-first hero will lift add-to-cart rate by 8% or more, because session recordings show cold-traffic visitors scroll past the lifestyle hero without engaging in 78% of sessions."

The strong version forces you to commit. You can't dodge afterwards. If add-to-cart moves less than 8%, your hypothesis was wrong and the change probably isn't worth shipping. If it moves more, you've learned something specific you can repeat. That's the point. Tests aren't there to make you feel busy. They exist to compound knowledge. A specific hypothesis is what makes the knowledge stick.

One primary KPI. One guardrail. Defined before launch.

This is where most briefs collapse. Teams pick four metrics, weight them differently, and then argue when two move up and two move down. The solution is brutal but it works: one primary KPI, one guardrail, written down before the test goes live.

The primary KPI is the number you're trying to move. The guardrail is the number you refuse to break in pursuit of it. For a PDP test, the primary might be add-to-cart rate; the guardrail might be average order value, because you don't want a "win" that came from people adding cheaper SKUs. For a checkout test, the primary might be checkout completion; the guardrail might be refund rate at day 30, because pushing people through faster sometimes pushes the wrong people through.

Once those two metrics are locked, every other movement becomes a side note. You stop debating six numbers. You debate two. And critically, you debate them against thresholds you wrote down when nobody knew which variant they preferred. That's the only honest way to read a test.

Weak brief versus strong brief comparison Two side-by-side panels comparing a vague test brief on the left with a specific, decision-ready brief on the right. Weak brief Strong brief Hypothesis "Improve PDP conversion" KPI "Conversion rate" Guardrail Not defined Sample size "Run for a few weeks" Decision rule Not defined Owner "The team" Hypothesis Product-first hero will lift ATC by 8%+ Primary KPI Add-to-cart rate (cold traffic only) Guardrail AOV must not drop more than 3% Sample size 38,000 sessions per variant / 21 days Decision rule 95% sig + 8% lift = ship. Else iterate. Owner CMO decides. CRO lead recommends.
Same test idea, two briefs. Only one of them can be resolved with data.

CRO Obsessed

If your last three tests ended in arguments instead of decisions, the brief is the bug — not the team.

PMD is a full-funnel CRO and profit-optimisation agency. We help subscription and high-LTV Shopify brands write briefs that move numbers and ship work that compounds.

Book a 30-min call with Paddy McLarnon →

Calculate sample size and duration before you launch, not after

I cannot count the number of times I've heard a founder say "let's just run it for two weeks and see". Two weeks is not a sample size. It's a calendar guess. And it's the single most expensive habit in DTC experimentation.

Before a test goes live, the brief must contain a sample-size calculation based on three inputs: your baseline conversion rate, the minimum detectable effect (MDE) you care about, and your tolerance for false positives. If your baseline ATC rate is 6%, you want to detect an 8% relative lift, and you want 95% confidence with 80% power, you need roughly 38,000 sessions per variant. At your current traffic, that might be three weeks. Or it might be eight. The point is you know before the test starts.

This single discipline eliminates two of the most damaging behaviours in CRO: peeking at data early and declaring fake winners, and stopping tests prematurely because someone is bored of waiting. If your brief locks duration to a number tied to power, those temptations die. We covered the maths in detail in our A/B test duration guide, which I recommend every founder reads once before running another experiment.

"A brief without a decision rule is a permission slip to argue. Write the rule, sign the brief, then go and run the test."

Write the decision rule before you see the data

This is the rule that most teams skip and the one I now consider non-negotiable. Before launch, write three sentences in the brief:

  1. If X happens, we ship the variant.
  2. If Y happens, we iterate and retest.
  3. If Z happens, we kill it and move on.

Concrete example: "If the variant beats control on add-to-cart rate by 8% or more at 95% significance, with no AOV drop greater than 3%, we ship. If the lift is positive but under 8%, we iterate on the hero copy and retest. If the variant loses or moves nothing, we kill the concept and reallocate the slot."

The reason this works is psychological, not statistical. By the time the data comes in, every stakeholder has a favourite. The CMO who briefed it wants it to win. The CFO who funded it wants a clean answer. The agency wants the case study. If you don't write the rule before the data lands, you will rationalise after it. I've watched it happen in rooms where the people involved were genuinely senior and smart. It's not a smarts problem. It's a structure problem. The same logic applies across the funnel in our CRO audit framework, where decision rules show up in every diagnostic.

Name the owner who decides at the end

The final and most political part of a brief. Who decides? Not who runs the test. Not who builds the variant. Who, after the data lands and the decision rule is applied, makes the call to ship, iterate, or kill?

If the brief says "the team decides", the team will not decide. The CMO will defer to the CFO, the CFO will defer to the agency, the agency will defer to the CMO, and the test will sit in limbo for two weeks while the next sprint stalls. Every brief we sign at PM Digital Design names a single decision-maker by job title and human name. Usually it's the CMO or the head of growth. Sometimes it's the founder. It is never "the team".

The decision-maker's job is to apply the rule, not relitigate it. If the decision rule says "8% lift + 95% sig = ship", and the variant hits both, the decision-maker ships. They don't reopen the hypothesis. They don't ask for one more week. They ship. That single piece of discipline is what separates programmes that compound learning from programmes that thrash. You can see this discipline in action in our published case study on how a cart-threshold split test drove £25,535 of additional monthly revenue: the only reason that test got shipped on time was that the decision rule and owner were locked in the brief.

A live brief in action: the apparel case

To make this concrete, here's how a brief looked recently on a £14M apparel brand we worked with. The team had been running tests for nine months with no shipped wins. We didn't change their tooling. We changed their brief template.

The brief: "We believe replacing the four-step checkout with a single-page express layout will lift checkout completion rate by 6% or more, because mobile session recordings show 41% of users abandon between step two and step three. Primary KPI: checkout completion rate, mobile only. Guardrail: refund rate at day 30 must not rise above 2.1%. Sample size: 22,000 sessions per variant, expected duration 18 days. Decision rule: 6% lift + 95% sig + guardrail held = ship. 2-6% lift = iterate on form fields and retest. Below 2% = kill. Owner: Head of Growth (Sarah). Review meeting: day 21."

The test ran 19 days. Completion rate lifted 7.4%, significance was 96%, refund rate held. Sarah shipped it the same week. The whole programme started compounding from there. Nothing about the tooling, the developers, or the design changed. Only the brief. If you're running a similar setup and want a second pair of eyes on your brief template, grab a 30-minute slot with Paddy McLarnon and we'll walk through one of your live briefs together.

What I tell every founder and marketer about briefs

The brief is not paperwork. It's the contract between you, the data, and the next six weeks. If it's vague, your six weeks will be vague. If it's specific, your six weeks will compound. Spend more time on the brief than on the test build. Genuinely. Two hours of brief work saves two weeks of arguing.

The teams I see win in CRO are not the teams with the fanciest tools. Convert, VWO, Shopify's native split-testing, Optimizely, doesn't matter. The teams that win are the teams whose briefs are clean. One hypothesis. One primary KPI. One guardrail. One sample-size calculation. One decision rule. One owner. Six lines on a page. That's the whole job before launch. The rest is execution, and execution is the easy part.

If you want help auditing how your team writes briefs, or you'd like the PMD brief template we use on client programmes, our team handles this kind of work every day. You can see the broader methodology in our profit optimisation service and in the deeper learning material in our CRO learning hub. For brands ready to migrate or rebuild the storefront the testing programme runs on, our Shopify websites service covers the build side end to end.

FAQs

How long should a CRO test brief be?

One page. Six sections: hypothesis, primary KPI, guardrail, sample size with duration, decision rule, owner. If your brief is longer than a page, you're hiding behind detail. If it's shorter, you're missing a section. One page, signed by the owner, before any dev work starts.

Who should write the test brief — the agency or the in-house team?

Whoever runs the experimentation programme drafts it. Whoever owns the metric signs it. In most of our engagements, our CRO lead drafts the brief based on the diagnostic, and the client's head of growth or CMO signs it before we touch code. The signature matters more than the draft. Without sign-off, you're guessing about alignment.

What's the difference between a primary KPI and a guardrail metric?

The primary KPI is the number you're trying to move. The guardrail is the number you refuse to break while moving it. A 12% lift in conversion that quietly tanks AOV by 9% is not a win. The guardrail catches Pyrrhic victories before they ship.

Do I need to calculate sample size if I'm using Shopify's native testing tools?

Yes. Tooling does not absolve you of statistics. Most native tools will happily declare a winner on traffic that's far below what you need for confidence. Run the sample-size maths yourself, lock the duration in the brief, and ignore the tool's nudges until you hit your number.

What if the test result is ambiguous — neither clearly shipping nor clearly killing?

That's exactly what the "iterate" branch of your decision rule is for. Ambiguous results are normal. The brief should pre-define what ambiguous looks like (e.g. positive lift but under MDE) and what the next move is. Without that branch, every ambiguous test becomes a two-week debate.

How many tests should we run before we expect compounding learning?

If your briefs are clean, you'll start compounding from test one because every result, win or lose, teaches you something specific. If your briefs are vague, twenty tests will leave you no smarter than you started. The brief is the compounding mechanism, not the volume of tests. Speak to the PMD team if you want a brief audit on your last five experiments.

Full-funnel CRO. Profit obsessed.

Want this on your store?

We help subscription and high-LTV Shopify brands turn cold traffic into post-click profit. Strategy, copy, design, development, and CRO under one roof.

Back to blog