A/B testing is a beautiful idea. Split your list, vary one thing, measure the response, pick the winner. For two decades, the "thing" for email was usually the subject line, and the response was almost always the open rate.
On 20 September 2021 that stopped working. This article walks through the statistics of why, shows what a defensible subject-line test looks like now, and proposes the modern replacement for the entire A/B workflow.
The maths of an A/B test, pre-MPP
Suppose in 2020 your list of 100,000 was split 50/50 between two subject lines. Baseline open rate around 25%. You observe:
- Variant A: 12,500 opens out of 50,000 sent (25.0%).
- Variant B: 13,200 opens out of 50,000 sent (26.4%).
The absolute difference is 1.4 percentage points. Using a standard proportion test, the p-value for this comparison is comfortably under 0.01 — you can declare variant B the winner with confidence, and the 1.4 points represents a real lift in human attention.
The same test, post-MPP
Fast-forward to 2027. The same list, same split, same test. Your reported open rate for both variants now sits around 60%, heavily inflated by Apple MPP pre-fetch. You observe:
- Variant A: 30,000 opens out of 50,000 sent (60.0%).
- Variant B: 30,800 opens out of 50,000 sent (61.6%).
The absolute difference is still 1.6 percentage points, and the p-value still appears significant. But here is the catch: both numbers are dominated by a deterministic floor of pre-fetch opens that is independent of the subject line. Roughly 40 percentage points of both numbers come from Apple MPP firing the pixel whether or not a human read anything.
Strip out the pre-fetch floor. The "real human open" component is maybe 20% for A and 21.6% for B. That is a 1.6-point lift on a 20-point base — relatively a bigger difference, but measured on a quantity you cannot observe cleanly.
Why the signal-to-noise ratio collapses
The real problem is not the arithmetic. It is that the pre-fetch component is not a constant background — it is a source of noise that scales with list composition.
If variant A happens to be sent to a slightly Apple-heavier half of the list by accident, variant A's open rate will be higher regardless of the subject line. Your ESP's random split is supposed to handle that, but with list skew, random sampling variation, and the sheer magnitude of the MPP contribution, the noise floor on the difference is much higher than it was in 2020.
A 2020-style A/B test that would have called a 1.5-point lift significant now requires a 4–6 point lift to be defensible, because the MPP noise floor is swallowing smaller effects. Most subject-line tests produce real effects of 1–3 points — which means most tests are now statistically invisible.
What a defensible subject-line test looks like now
If you still want to test subject lines — and you should, because subject lines do matter — here is the modern approach.
1. Measure click-through, not open
The subject line affects whether the recipient opens the message, which affects whether they click a link inside. Click rate is therefore a downstream indicator of subject-line effectiveness and is much less polluted by pre-fetch.
The signal is weaker — click rates are lower, so you need more volume — but the noise floor is also much lower. Net-net, click-based subject-line tests have better statistical power than open-based ones in 2027.
2. Stratify by provider
Send variant A and B in matched pairs within each provider segment. Measure the lift per provider. If variant B wins on Gmail and Outlook but not on Apple, that is still useful information — subject-line effects are partly provider-dependent because inbox algorithms vary.
3. Use reply rate for outbound
For cold and warm outbound, reply rate is the cleanest subject-line signal available. A reply requires a human to both open and respond. Sample sizes are smaller, so you need either bigger lifts or bigger lists, but the signal is uncontaminated.
4. Use a longer time window
Pre-fetch opens happen fast; real human opens and clicks accumulate over days. If you measure the 48-hour window rather than the 2-hour window, the human signal grows and the pre-fetch contribution becomes proportionally smaller.
The modern replacement: content testing, not subject-line testing
The deeper lesson is that subject-line A/B testing was always a proxy for "what gets attention", and attention now shows up in clicks, replies, and conversions rather than opens. Modern email teams are shifting towards testing the entire message (subject + preheader + content + CTA) against a downstream outcome (click, reply, conversion), using the subject line as one variable in a multivariate framework.
This is a better test in every dimension:
- It measures a real outcome, not a pixel event.
- It captures subject-line effects indirectly but cleanly — a bad subject suppresses downstream actions.
- It tests the thing you actually care about, which is whether the message worked.
Worked example: modern A/B/n test
Suppose you want to test three subject lines on a 300,000-recipient newsletter with a CTA to read a blog post. Modern design:
- Split the list into three equal cells of 100,000.
- Send each cell a different subject line, with identical body and CTA.
- Measure over 48 hours: verified clicks on the CTA link, and any reply to the sender address.
- Ignore open rate entirely.
- Report per-provider click rate per variant. Pick the variant that wins at Gmail and Outlook (the providers where clicks are cleanest).
You will get less pseudo-precision than the old open-rate report — the numbers are smaller, the confidence intervals are wider — but what you get will be trustworthy and actionable.
Subject-line tests assume the mail reached the recipient. If your inbox placement drops, no subject line will save the campaign. Inbox Check runs seed-based placement tests so you can separate a subject-line problem from a deliverability problem. Free test at the homepage.
What to stop doing
- Stop running 10%/10%/80% champion-challenger splits based on 2-hour open rate. You are almost always picking pre-fetch noise as the winner.
- Stop reporting "open rate lift" as evidence a subject line worked. At best it is weak evidence; at worst it is spurious.
- Stop letting ESPs auto-declare subject-line winners for you. The automation is doing arithmetic on a metric that no longer means what the arithmetic assumes.