Email Degradation: Spot It Before Crisis

The conventional failure story is dramatic: a campaign triggers a spam complaint spike, reputation tanks, placement drops from 90% to 30% in 48 hours, the on-call channel lights up. It happens, but it is the exception.

The common story is boring: placement at Gmail drifts from 88% to 84% to 79% to 72% over eight weeks, alerts never fire (each drop is inside your tolerance), nobody notices until open rates halve. By the time someone investigates, three ESP migrations have blurred the signal and the original cause is forgotten.

The shape of degradation

Slow degradation is most commonly driven by list hygiene drift (accumulating unengaged recipients), content drift (a template change six weeks ago picked up a subtle trigger), or volume drift (gradual send-volume creep past the capacity your reputation can support). Each is invisible to a point-in-time alert.

Why static thresholds miss degradation

Static thresholds fire on level, not slope. If your threshold is "alert below 75%" and your rate drifts from 88% to 77% over six weeks, you never alert. When it finally crosses 75% it is a late signal — you are already in the degraded state and have been for weeks.

Rolling thresholds (compare today to last 7 days) catch sudden drops but also miss slow drift, because a 1-point drop per week stays inside the week-over-week noise floor.

Detecting the slope, not the level

The trick is to fit a linear regression to the last N days and alert on the slope. If the 30-day trend line has a negative slope steeper than -0.5 percentage points per week, that is a degradation signal even if the current rate is still above your static floor.

# Pseudocode (Python with scipy.stats.linregress)
from scipy.stats import linregress
import sqlite3, time

db = sqlite3.connect('monitor.db')
cur = db.execute("""
  SELECT (ts/86400000) AS day_idx, AVG(100.0 * inbox/total) AS rate
  FROM placements
  WHERE domain = ? AND ts > ?
  GROUP BY day_idx
  ORDER BY day_idx
""", (domain, int(time.time() * 1000) - 30 * 86400 * 1000))

rows = cur.fetchall()
days = [r[0] for r in rows]
rates = [r[1] for r in rows]

slope, intercept, r, p, se = linregress(days, rates)

# slope is in "percentage points per day"
# -0.07 pp/day = -0.5 pp/week = alert
if slope < -0.07 and p < 0.05:
    alert(f"{domain} degrading at {slope*7:.1f}pp/week (p={p:.3f})")

Or: CUSUM, the no-fuss alternative

If regression feels heavy, CUSUM (cumulative sum) is a simpler change-point detector that works in plain SQL. It accumulates deviations from a reference mean; when the cumulative sum exceeds a threshold, you have a change point.

-- CUSUM in pure sqlite
WITH reference AS (
  SELECT AVG(100.0 * inbox/total) AS mu
  FROM placements
  WHERE domain = 'acme.io' AND ts > strftime('%s','now','-60 days','-30 days') * 1000
                          AND ts < strftime('%s','now','-30 days') * 1000
),
recent AS (
  SELECT ts, (100.0 * inbox / total) AS rate
  FROM placements
  WHERE domain = 'acme.io' AND ts > strftime('%s','now','-30 days') * 1000
  ORDER BY ts
)
SELECT
  datetime(ts/1000, 'unixepoch') AS when,
  rate,
  SUM(rate - (SELECT mu FROM reference)) OVER (ORDER BY ts) AS cusum
FROM recent;

A CUSUM that trends negative for two weeks and accumulates more than -30 indicates degradation. Exact thresholds depend on your baseline variance; tune over a month of data.

Four signals that predict degradation

1. Rising complaint rate

Gmail Postmaster Tools reports user-reported spam as a percentage of delivered mail. Anything above 0.1% is concerning; above 0.3% is already hurting you. Complaint rate predicts placement drop by 2—4 weeks.

2. Falling engagement

Open rate decay on a steady sending pattern means your list is drifting toward unengaged recipients. MTAs infer this and start routing to spam. Engagement decay precedes placement decay by about a month.

3. Specific-provider divergence

When Gmail is holding steady at 88% but Outlook is drifting from 82% down to 70% over six weeks, that is a Microsoft-specific signal — usually SNDS reputation or Smart Network Data complaints. Monitor per-provider, not just aggregate.

4. Bounce rate creep

A slow rise in hard bounces indicates list hygiene is drifting. Even before it affects placement directly, it costs you reputation in the aggregate.

Run the regression weekly, not nightly

Degradation moves on a multi-week timescale. A nightly regression check introduces noise without signal. Run it every Sunday at midnight and alert if the slope crossed the threshold that week. Your on-call rotation will thank you.

Intervention playbook

When the degradation alert fires, here is the investigation order:

Cohort the placement by template. If one template dominates the drift, it is a content issue.
Cohort by audience segment. If one segment (say, users who signed up more than 18 months ago and have not opened in 90 days) is driving the drop, it is a list hygiene issue.
Check volume trajectory. If send volume grew 40% over the same window, you may have exceeded the reputation ceiling of your current setup.
Check DNS and ESP changes. Git log or change log for the last 8 weeks. Anything touched on the sending path?

FAQ

How much historical data do I need before slope detection works?

30 days minimum, 60 days comfortable. Less than that and the regression is too noisy to trust. Start collecting placement data before you need it.

Can I use Grafana's built-in alerting for this?

Yes. Grafana 10+ supports expression-based alerting. Build a query that computes the 30-day linear regression slope and alert when it drops below your threshold. Less code than the Python example but harder to debug.

What if my baseline rate has always been ~70%?

Slope detection works on any baseline. The threshold is the rate of change, not the absolute value. A 70% sender decaying to 60% trips the same regression alert as a 90% sender decaying to 80%.

Is this overkill for small senders?

For fewer than 10k messages per month, probably. For anything at scale (100k+ monthly, or anything transactional where placement matters), slope detection pays for itself the first time it catches a regression a month early.

Spot deliverability degradation before the crash