A deliverability tool whose own operators do not use it on their own mail is a dubious thing. We use ours. Every piece of mail this company sends goes through an automated placement check, every day, and the results feed a small set of dashboards that are the second thing we look at in the morning after the status page. This is how that works, in enough detail that you can copy the pattern.
The mail we send ourselves
The outbound traffic we care about falls into four categories:
- Transactional notifications. Password resets, incident alerts, API-key rotation reminders, test-complete emails for customers who request notify-me-when-done. Low volume, high expectation: if these land in Spam the customer assumes the product is broken.
- The monthly digest. Every subscriber gets a once-a-month summary of placement trends they care about. Medium volume, bulk-sender rules apply.
- Product announcements. A few a year. Medium volume, spikier pattern.
- Operational mail to ourselves. Canary outputs, cron failure alerts, weekly operational summaries to the team. Very low volume, but their not arriving is a signal we care about.
Daily automated placement tests
Every morning at 06:00 UTC a scheduled job runs a placement test for each of the categories above using real representative content. The test is fired through our production sending infrastructure and seeded into our own seed pool, the same way any customer's test is. The result is written to a dedicated "dogfood" tenant and becomes a time series in Grafana.
Running the test through production (rather than through a staging path) is important. Production is where the domain reputation lives; a staging send would give us a clean but unrepresentative signal.
Thresholds we alert on
Three thresholds drive our alerting. Each one triggers a different severity.
Inbox rate
Per sender category, across the seed pool:
- Transactional: below 95% Inbox across the top 5 providers triggers sev2. Below 85% triggers sev1.
- Monthly digest: below 80% Inbox on the run immediately before a planned send triggers sev2 and pauses the send until an operator clears it.
- Announcements: same as digest.
DMARC fail rate
We publish a strict p=reject DMARC policy and collect aggregate reports hourly. Any non-zero fail count that is not our own known canary-sender is a potential misconfiguration or a spoof attempt. Above 1% fail rate on aggregate volume is an immediate sev1 — we stop sending until we understand why.
DNSBL listing
We check our sending IPs and domain against Spamhaus ZEN, Spamhaus DBL, SURBL and a handful of regional lists every 15 minutes. A single listing anywhere is a sev1. We have been listed exactly once in two years and it was resolved inside an hour; see the status-page article.
How alerts route
Alerts go to Slack for everything, and to PagerDuty for sev1. The rule is simple: if a placement problem is severe enough that waiting until tomorrow would make it worse (domain reputation is time-sensitive), a human gets woken up. Otherwise the alert sits in Slack with a thread and a SLA of "next business morning".
severity matrix
sev1 domain blacklist listing
DMARC fail rate > 1%
transactional inbox rate < 85%
-> pager, immediate ack target 15 minutes
sev2 transactional inbox rate < 95%
digest inbox rate < 80% in pre-send check
any seed provider marked degraded
-> slack, ack target 4 hours in business hours
sev3 SPF record changed upstream
single-provider drop of 5+ percentage points
-> slack, triage at next business morningThe runbook when placement drops
- Confirm it is real. Re-run the placement test immediately. One-off dips happen; a second confirmed drop is a signal.
- Check authentication. Has SPF changed? Is DKIM still signing with the right selector? DMARC still aligning? An upstream change from a sending vendor accounts for more of our real incidents than anything else.
- Check DNSBL listings. If we are listed, file the delisting request immediately and stop sending non-essential mail until it clears.
- Check content. Did marketing ship a new template last night? Did an incident email go out with a suspicious-looking link? A content change is the second most common real cause.
- Check provider-side. Is Gmail Postmaster showing reduced domain reputation? Is Microsoft SNDS showing the IP as yellow? If one provider's reputation moved, that is the story.
When placement is degraded, non-essential outbound is paused immediately. Sending more mail into a reputation problem is the fastest way to turn a bad day into a bad quarter.
Incident one: the transactional dip
Six weeks ago, transactional inbox rate on Gmail dropped from 99% to 83% overnight. Sev1 fired at 06:04 UTC. An engineer was on the keyboard at 06:18.
Root cause: our transactional ESP had quietly rotated a shared IP range, and one of the new IPs had been used by a spammer previously. Our SPF include covered the new range, so authentication still passed, but Gmail's IP-level reputation was bad.
Fix: asked the ESP to move us to a different IP; paused non-essential transactional mail for 36 hours; sent only to the most engaged recipients during a controlled warmup. Recovery to 99% took nine days. The full postmortem is on the status page.
Incident two: the SendGrid SPF truncation
Earlier this year our aggregate DMARC reports showed a spike of SPF failures — around 4% of volume, well over the 1% threshold. Sev1 fired.
Root cause: we had drifted over the 10-lookup limit in SPF by adding a new sending vendor without pruning an old one. SPF evaluated as permerror for some recipients, which DMARC interpreted as a fail.
Fix: consolidated the SPF record to stay under 10 lookups; deprecated the old vendor's include. DMARC fail rate returned to baseline within a day. Lesson: we now alert on SPF lookup count as part of our DNS health monitor, not just on the downstream DMARC symptom.
What we learned about our own product
Using the product on our own mail surfaced things our customers had been politely asking for but we had not prioritised:
- Alerts on per-provider drops, not just aggregates. A 5-point drop on Gmail is signal; averaged with Outlook it vanishes.
- Historical trend lines per provider, at least 90 days back, so a slow drift is visible.
- Pre-send checks that actually block a scheduled campaign when the pre-send placement is below threshold, rather than just warning.
All three are in the product now because our own incidents demanded them.