Uptime monitoring is a solved problem. Pingdom in 2006, StatusCake and UptimeRobot in 2012, BetterStack and Cronitor now. You pay a few dollars a month, point it at your site, and page on-call when the green dot goes red.
Email has no equivalent. Your ESP dashboard shows "delivered" when the destination MTA accepts the message. "Delivered" does not mean inbox. "Delivered" includes spam folder. Your status page stays green while half your customers never see the password reset mail.
Website uptime is about reachability. Email "uptime" is about placement. The two require completely different instrumentation. A 200 OK on /health tells you nothing about whether Gmail trusts your domain today.
What does "email up" even mean?
Four things have to hold for a recipient to see your mail in their inbox:
- Your SMTP outbound works. Your application server (or your ESP) can connect to the internet and deliver to destination MTAs.
- Authentication passes. SPF, DKIM, and DMARC all evaluate to pass at the recipient's MTA.
- The MTA accepts the message. Not rejected for bad reputation, bad content, volume anomaly, or trigger-word filters.
- The mail lands in the inbox, not the spam/junk/promotions folder.
Standard monitoring catches (1) and sometimes (3). It never catches (2) as a real-time signal, and it never, ever catches (4). Your ESP dashboard also only catches (3).
The minimum viable email-up system
Two components: a periodic synthetic sender, and a placement probe.
- Synthetic sender. A cron or scheduled worker that sends a real message through your real production sending path every 15 minutes (or hourly for low-volume senders). Not a health check, a real message.
- Placement probe. A seed mailbox panel at Gmail, Outlook, Yahoo, and so on. Something reads those mailboxes and reports inbox/spam/missing per provider.
The probe is what you would otherwise build yourself. It is a lot of infrastructure — 20+ mailboxes, provider auth, scraping without breaking terms, rotation — which is why most teams either buy or skip it. Skipping it means flying blind.
A $0 version you can run today
Use the Inbox Check free API as the placement probe. One cron, one shell script, 15 minutes of setup.
#!/usr/bin/env bash
# /usr/local/bin/email_uptime_check.sh
set -euo pipefail
API="https://check.live-direct-marketing.online/api"
KEY="${INBOX_CHECK_API_KEY}"
DOMAIN="mail.acme.io"
# Create a synthetic test via our real sending infrastructure
RESP=$(curl -s -X POST "$API/check" \
-H "Authorization: Bearer $KEY" \
-H "Content-Type: application/json" \
-d "{
\"senderDomain\": \"$DOMAIN\",
\"subject\": \"Uptime check\",
\"html\": \"<p>ok</p>\"
}")
TEST_ID=$(echo "$RESP" | jq -r '.id')
# Poll for result (tests take ~2-5 minutes)
for i in {1..30}; do
STATUS=$(curl -s "$API/check/$TEST_ID" -H "Authorization: Bearer $KEY" | jq -r '.status')
[[ "$STATUS" == "complete" ]] && break
sleep 10
done
RATE=$(curl -s "$API/check/$TEST_ID" -H "Authorization: Bearer $KEY" | jq -r '.summary.inboxRate')
# Write to a status file your status page can read
echo "{\"domain\":\"$DOMAIN\",\"rate\":$RATE,\"ts\":\"$(date -Iseconds)\"}" \
> /var/www/status/email-latest.json# /etc/cron.d/email_uptime
*/15 * * * * monitor /usr/local/bin/email_uptime_check.shSurfacing it on your status page
Status-page tools (Statuspage.io, Instatus, Cachet) accept external components via webhooks or API. Post the rate as a metric, and configure a component status rule:
- Operational — inbox rate 90%+
- Degraded performance — inbox rate 75–89%
- Partial outage — inbox rate 50–74%
- Major outage — inbox rate below 50%
Give the component a distinct name — "Transactional email placement," not just "Email" — so users understand when "email delivered but in spam" is the real failure mode.
Inbox placement is a lagging indicator. A domain can look fine at 10:00 and be degraded at 12:00 without any step-change; the reputation at Gmail and Outlook moves on the order of hours, not seconds. Set incident thresholds conservatively (30+ minutes of sustained low rate), or you will alert-fatigue yourself out of taking real signals seriously.
What an incident actually looks like
When email placement drops hard, here is what you will typically find in order of decreasing frequency:
- DNS record regression. SPF, DKIM, or DMARC was changed recently by somebody who did not know what they were doing.
- Content regression. A new template has trigger words, bad HTML, or a suspicious redirect.
- IP reputation hit. Shared IP pool took on a noisy neighbour; dedicated IP got listed somewhere.
- Sending volume anomaly. Batch job pumped 10x normal volume overnight; the MTAs throttled.
- Recipient feedback loop. A wave of spam complaints from a recent campaign tanked reputation for downstream mail.
A runbook with these five items, in order, answers 90% of incidents.