Why iOS 15 Killed Subject-Line A/B Testing (And What to Do Instead)

A/B testing is a beautiful idea. Split your list, vary one thing, measure the response, pick the winner. For two decades, the "thing" for email was usually the subject line, and the response was almost always the open rate.

On 20 September 2021 that stopped working. This article walks through the statistics of why, shows what a defensible subject-line test looks like now, and proposes the modern replacement for the entire A/B workflow.

The maths of an A/B test, pre-MPP

Suppose in 2020 your list of 100,000 was split 50/50 between two subject lines. Baseline open rate around 25%. You observe:

Variant A: 12,500 opens out of 50,000 sent (25.0%).
Variant B: 13,200 opens out of 50,000 sent (26.4%).

The absolute difference is 1.4 percentage points. Using a standard proportion test, the p-value for this comparison is comfortably under 0.01 — you can declare variant B the winner with confidence, and the 1.4 points represents a real lift in human attention.

The same test, post-MPP

Fast-forward to 2027. The same list, same split, same test. Your reported open rate for both variants now sits around 60%, heavily inflated by Apple MPP pre-fetch. You observe:

Variant A: 30,000 opens out of 50,000 sent (60.0%).
Variant B: 30,800 opens out of 50,000 sent (61.6%).

The absolute difference is still 1.6 percentage points, and the p-value still appears significant. But here is the catch: both numbers are dominated by a deterministic floor of pre-fetch opens that is independent of the subject line. Roughly 40 percentage points of both numbers come from Apple MPP firing the pixel whether or not a human read anything.

Strip out the pre-fetch floor. The "real human open" component is maybe 20% for A and 21.6% for B. That is a 1.6-point lift on a 20-point base — relatively a bigger difference, but measured on a quantity you cannot observe cleanly.

Subject-line A/B test on a 100K list: pre-MPP (2020) vs post-MPP (2027)
Dimension	Pre-MPP (2020)	Post-MPP (2027)
Reported open rate (A)	25.0%	60.0%
Reported open rate (B)	26.4%	61.6%
Absolute lift	+1.4 pts	+1.6 pts
Pre-fetch contribution	~0	~40 pts (Apple MPP)
Real human-open component	25% / 26.4%	~20% / ~21.6%
Lift detectable by 100K test	~1.5 pts	~4–6 pts (noise floor)
Sample size for comparable power	20K / cell	60–100K / cell

Why the signal-to-noise ratio collapses

The real problem is not the arithmetic. It is that the pre-fetch component is not a constant background — it is a source of noise that scales with list composition.

If variant A happens to be sent to a slightly Apple-heavier half of the list by accident, variant A's open rate will be higher regardless of the subject line. Your ESP's random split is supposed to handle that, but with list skew, random sampling variation, and the sheer magnitude of the MPP contribution, the noise floor on the difference is much higher than it was in 2020.

The practical consequence

A 2020-style A/B test that would have called a 1.5-point lift significant now requires a 4–6 point lift to be defensible, because the MPP noise floor is swallowing smaller effects. Most subject-line tests produce real effects of 1–3 points — which means most tests are now statistically invisible.

What a defensible subject-line test looks like now

If you still want to test subject lines — and you should, because subject lines do matter — here is the modern approach.

1. Measure click-through, not open

The subject line affects whether the recipient opens the message, which affects whether they click a link inside. Click rate is therefore a downstream indicator of subject-line effectiveness and is much less polluted by pre-fetch.

The signal is weaker — click rates are lower, so you need more volume — but the noise floor is also much lower. Net-net, click-based subject-line tests have better statistical power than open-based ones in 2027.

2. Stratify by provider

Send variant A and B in matched pairs within each provider segment. Measure the lift per provider. If variant B wins on Gmail and Outlook but not on Apple, that is still useful information — subject-line effects are partly provider-dependent because inbox algorithms vary.

3. Use reply rate for outbound

For cold and warm outbound, reply rate is the cleanest subject-line signal available. A reply requires a human to both open and respond. Sample sizes are smaller, so you need either bigger lifts or bigger lists, but the signal is uncontaminated.

4. Use a longer time window

Pre-fetch opens happen fast; real human opens and clicks accumulate over days. If you measure the 48-hour window rather than the 2-hour window, the human signal grows and the pre-fetch contribution becomes proportionally smaller.

The modern replacement: content testing, not subject-line testing

The deeper lesson is that subject-line A/B testing was always a proxy for "what gets attention", and attention now shows up in clicks, replies, and conversions rather than opens. Modern email teams are shifting towards testing the entire message (subject + preheader + content + CTA) against a downstream outcome (click, reply, conversion), using the subject line as one variable in a multivariate framework.

This is a better test in every dimension:

It measures a real outcome, not a pixel event.
It captures subject-line effects indirectly but cleanly — a bad subject suppresses downstream actions.
It tests the thing you actually care about, which is whether the message worked.

Worked example: modern A/B/n test

Suppose you want to test three subject lines on a 300,000-recipient newsletter with a CTA to read a blog post. Modern design:

Split the list into three equal cells of 100,000.
Send each cell a different subject line, with identical body and CTA.
Measure over 48 hours: verified clicks on the CTA link, and any reply to the sender address.
Ignore open rate entirely.
Report per-provider click rate per variant. Pick the variant that wins at Gmail and Outlook (the providers where clicks are cleanest).

You will get less pseudo-precision than the old open-rate report — the numbers are smaller, the confidence intervals are wider — but what you get will be trustworthy and actionable.

Measure the delivery side, too

Subject-line tests assume the mail reached the recipient. If your inbox placement drops, no subject line will save the campaign. Inbox Check runs seed-based placement tests so you can separate a subject-line problem from a deliverability problem. Free test at the homepage.

What to stop doing

Stop running 10%/10%/80% champion-challenger splits based on 2-hour open rate. You are almost always picking pre-fetch noise as the winner.
Stop reporting "open rate lift" as evidence a subject line worked. At best it is weak evidence; at worst it is spurious.
Stop letting ESPs auto-declare subject-line winners for you. The automation is doing arithmetic on a metric that no longer means what the arithmetic assumes.

FAQ

How much bigger does a list need to be now to detect a subject-line effect?

Roughly 3–5x. If a 2020 test needed 20,000 per cell to detect a 2-point lift, a 2027 click-based test needs 60–100,000 per cell for comparable power.

Can I still use ESP-reported open rate if I only care about relative changes over time?

Longitudinally, on a stable list, yes — a 20-point drop in open rate over a month is almost certainly real. For one-off A/B comparisons the metric is too noisy.

What about preheader tests?

Same logic as subject lines. Measure downstream — click, reply, conversion — not the open event itself.

Do any ESPs handle this correctly out of the box?

A handful support click-based champion-challenger testing. Most still default to open-based. Check the test configuration carefully before you set a recurring automation.

Why iOS 15 killed subject-line A/B testing