The conventional failure story is dramatic: a campaign triggers a spam complaint spike, reputation tanks, placement drops from 90% to 30% in 48 hours, the on-call channel lights up. It happens, but it is the exception.
The common story is boring: placement at Gmail drifts from 88% to 84% to 79% to 72% over eight weeks, alerts never fire (each drop is inside your tolerance), nobody notices until open rates halve. By the time someone investigates, three ESP migrations have blurred the signal and the original cause is forgotten.
Slow degradation is most commonly driven by list hygiene drift (accumulating unengaged recipients), content drift (a template change six weeks ago picked up a subtle trigger), or volume drift (gradual send-volume creep past the capacity your reputation can support). Each is invisible to a point-in-time alert.
Why static thresholds miss degradation
Static thresholds fire on level, not slope. If your threshold is "alert below 75%" and your rate drifts from 88% to 77% over six weeks, you never alert. When it finally crosses 75% it is a late signal — you are already in the degraded state and have been for weeks.
Rolling thresholds (compare today to last 7 days) catch sudden drops but also miss slow drift, because a 1-point drop per week stays inside the week-over-week noise floor.
Detecting the slope, not the level
The trick is to fit a linear regression to the last N days and alert on the slope. If the 30-day trend line has a negative slope steeper than -0.5 percentage points per week, that is a degradation signal even if the current rate is still above your static floor.
# Pseudocode (Python with scipy.stats.linregress)
from scipy.stats import linregress
import sqlite3, time
db = sqlite3.connect('monitor.db')
cur = db.execute("""
SELECT (ts/86400000) AS day_idx, AVG(100.0 * inbox/total) AS rate
FROM placements
WHERE domain = ? AND ts > ?
GROUP BY day_idx
ORDER BY day_idx
""", (domain, int(time.time() * 1000) - 30 * 86400 * 1000))
rows = cur.fetchall()
days = [r[0] for r in rows]
rates = [r[1] for r in rows]
slope, intercept, r, p, se = linregress(days, rates)
# slope is in "percentage points per day"
# -0.07 pp/day = -0.5 pp/week = alert
if slope < -0.07 and p < 0.05:
alert(f"{domain} degrading at {slope*7:.1f}pp/week (p={p:.3f})")Or: CUSUM, the no-fuss alternative
If regression feels heavy, CUSUM (cumulative sum) is a simpler change-point detector that works in plain SQL. It accumulates deviations from a reference mean; when the cumulative sum exceeds a threshold, you have a change point.
-- CUSUM in pure sqlite
WITH reference AS (
SELECT AVG(100.0 * inbox/total) AS mu
FROM placements
WHERE domain = 'acme.io' AND ts > strftime('%s','now','-60 days','-30 days') * 1000
AND ts < strftime('%s','now','-30 days') * 1000
),
recent AS (
SELECT ts, (100.0 * inbox / total) AS rate
FROM placements
WHERE domain = 'acme.io' AND ts > strftime('%s','now','-30 days') * 1000
ORDER BY ts
)
SELECT
datetime(ts/1000, 'unixepoch') AS when,
rate,
SUM(rate - (SELECT mu FROM reference)) OVER (ORDER BY ts) AS cusum
FROM recent;A CUSUM that trends negative for two weeks and accumulates more than -30 indicates degradation. Exact thresholds depend on your baseline variance; tune over a month of data.
Four signals that predict degradation
1. Rising complaint rate
Gmail Postmaster Tools reports user-reported spam as a percentage of delivered mail. Anything above 0.1% is concerning; above 0.3% is already hurting you. Complaint rate predicts placement drop by 2—4 weeks.
2. Falling engagement
Open rate decay on a steady sending pattern means your list is drifting toward unengaged recipients. MTAs infer this and start routing to spam. Engagement decay precedes placement decay by about a month.
3. Specific-provider divergence
When Gmail is holding steady at 88% but Outlook is drifting from 82% down to 70% over six weeks, that is a Microsoft-specific signal — usually SNDS reputation or Smart Network Data complaints. Monitor per-provider, not just aggregate.
4. Bounce rate creep
A slow rise in hard bounces indicates list hygiene is drifting. Even before it affects placement directly, it costs you reputation in the aggregate.
Degradation moves on a multi-week timescale. A nightly regression check introduces noise without signal. Run it every Sunday at midnight and alert if the slope crossed the threshold that week. Your on-call rotation will thank you.
Intervention playbook
When the degradation alert fires, here is the investigation order:
- Cohort the placement by template. If one template dominates the drift, it is a content issue.
- Cohort by audience segment. If one segment (say, users who signed up more than 18 months ago and have not opened in 90 days) is driving the drop, it is a list hygiene issue.
- Check volume trajectory. If send volume grew 40% over the same window, you may have exceeded the reputation ceiling of your current setup.
- Check DNS and ESP changes. Git log or change log for the last 8 weeks. Anything touched on the sending path?