Monitoring Our Own Deliverability: Eating Our Own Dog Food

A deliverability tool whose own operators do not use it on their own mail is a dubious thing. We use ours. Every piece of mail this company sends goes through an automated placement check, every day, and the results feed a small set of dashboards that are the second thing we look at in the morning after the status page. This is how that works, in enough detail that you can copy the pattern.

The mail we send ourselves

The outbound traffic we care about falls into four categories:

Transactional notifications. Password resets, incident alerts, API-key rotation reminders, test-complete emails for customers who request notify-me-when-done. Low volume, high expectation: if these land in Spam the customer assumes the product is broken.
The monthly digest. Every subscriber gets a once-a-month summary of placement trends they care about. Medium volume, bulk-sender rules apply.
Product announcements. A few a year. Medium volume, spikier pattern.
Operational mail to ourselves. Canary outputs, cron failure alerts, weekly operational summaries to the team. Very low volume, but their not arriving is a signal we care about.

Daily automated placement tests

Every morning at 06:00 UTC a scheduled job runs a placement test for each of the categories above using real representative content. The test is fired through our production sending infrastructure and seeded into our own seed pool, the same way any customer's test is. The result is written to a dedicated "dogfood" tenant and becomes a time series in Grafana.

Running the test through production (rather than through a staging path) is important. Production is where the domain reputation lives; a staging send would give us a clean but unrepresentative signal.

Thresholds we alert on

Three thresholds drive our alerting. Each one triggers a different severity.

Inbox rate

Per sender category, across the seed pool:

Transactional: below 95% Inbox across the top 5 providers triggers sev2. Below 85% triggers sev1.
Monthly digest: below 80% Inbox on the run immediately before a planned send triggers sev2 and pauses the send until an operator clears it.
Announcements: same as digest.

DMARC fail rate

We publish a strict p=reject DMARC policy and collect aggregate reports hourly. Any non-zero fail count that is not our own known canary-sender is a potential misconfiguration or a spoof attempt. Above 1% fail rate on aggregate volume is an immediate sev1 — we stop sending until we understand why.

DNSBL listing

We check our sending IPs and domain against Spamhaus ZEN, Spamhaus DBL, SURBL and a handful of regional lists every 15 minutes. A single listing anywhere is a sev1. We have been listed exactly once in two years and it was resolved inside an hour; see the status-page article.

How alerts route

Alerts go to Slack for everything, and to PagerDuty for sev1. The rule is simple: if a placement problem is severe enough that waiting until tomorrow would make it worse (domain reputation is time-sensitive), a human gets woken up. Otherwise the alert sits in Slack with a thread and a SLA of "next business morning".

severity matrix

  sev1   domain blacklist listing
         DMARC fail rate > 1%
         transactional inbox rate < 85%
         -> pager, immediate ack target 15 minutes

  sev2   transactional inbox rate < 95%
         digest inbox rate < 80% in pre-send check
         any seed provider marked degraded
         -> slack, ack target 4 hours in business hours

  sev3   SPF record changed upstream
         single-provider drop of 5+ percentage points
         -> slack, triage at next business morning

The runbook when placement drops

Confirm it is real. Re-run the placement test immediately. One-off dips happen; a second confirmed drop is a signal.
Check authentication. Has SPF changed? Is DKIM still signing with the right selector? DMARC still aligning? An upstream change from a sending vendor accounts for more of our real incidents than anything else.
Check DNSBL listings. If we are listed, file the delisting request immediately and stop sending non-essential mail until it clears.
Check content. Did marketing ship a new template last night? Did an incident email go out with a suspicious-looking link? A content change is the second most common real cause.
Check provider-side. Is Gmail Postmaster showing reduced domain reputation? Is Microsoft SNDS showing the IP as yellow? If one provider's reputation moved, that is the story.

The rule for non-essential mail during an incident

When placement is degraded, non-essential outbound is paused immediately. Sending more mail into a reputation problem is the fastest way to turn a bad day into a bad quarter.

Incident one: the transactional dip

Six weeks ago, transactional inbox rate on Gmail dropped from 99% to 83% overnight. Sev1 fired at 06:04 UTC. An engineer was on the keyboard at 06:18.

Root cause: our transactional ESP had quietly rotated a shared IP range, and one of the new IPs had been used by a spammer previously. Our SPF include covered the new range, so authentication still passed, but Gmail's IP-level reputation was bad.

Fix: asked the ESP to move us to a different IP; paused non-essential transactional mail for 36 hours; sent only to the most engaged recipients during a controlled warmup. Recovery to 99% took nine days. The full postmortem is on the status page.

Incident two: the SendGrid SPF truncation

Earlier this year our aggregate DMARC reports showed a spike of SPF failures — around 4% of volume, well over the 1% threshold. Sev1 fired.

Root cause: we had drifted over the 10-lookup limit in SPF by adding a new sending vendor without pruning an old one. SPF evaluated as permerror for some recipients, which DMARC interpreted as a fail.

Fix: consolidated the SPF record to stay under 10 lookups; deprecated the old vendor's include. DMARC fail rate returned to baseline within a day. Lesson: we now alert on SPF lookup count as part of our DNS health monitor, not just on the downstream DMARC symptom.

What we learned about our own product

Using the product on our own mail surfaced things our customers had been politely asking for but we had not prioritised:

Alerts on per-provider drops, not just aggregates. A 5-point drop on Gmail is signal; averaged with Outlook it vanishes.
Historical trend lines per provider, at least 90 days back, so a slow drift is visible.
Pre-send checks that actually block a scheduled campaign when the pre-send placement is below threshold, rather than just warning.

All three are in the product now because our own incidents demanded them.

Frequently asked questions

Do you use your own public free tier for this monitoring?

No — we use the paid internal tier with API access, because the monitoring runs at a cadence that would be noisy on the free tier. Customers on paid plans use the same API we use.

Why alert on per-provider drops instead of averages?

Averages hide correlated failures. A 15-point drop on Gmail with no movement on Outlook is a Gmail-specific problem — different cause, different fix — and the averaged alert would not fire until the problem had already been bad for a day.

How often have sev1 alerts been false positives?

Roughly one in four, mostly due to transient seed canary failures that resolved themselves in the next run. We accept that rate because the cost of a false negative on a real deliverability problem is much higher than a false positive.

Can I set this up for my own sending infrastructure?

Yes, with the API and a small amount of Prometheus/Grafana configuration. We have a guide on building a deliverability dashboard; the same daily-placement-test pattern works for any sender with a handful of dedicated sending categories.

Eating our own dog food — our real deliverability monitoring