Nagios / Zabbix / Prometheus: Alert "Emails Going to Spam"

If you already run Nagios, Zabbix, or Prometheus, the incremental cost of adding email placement to the same dashboard is nearly zero. One more check, one more metric, one more alert rule. Your on-call rotation already knows what to do when things page; adding email is a config change, not a new skill.

This guide shows three working configurations — pick the one that matches your stack. All three call the same underlying API. The only difference is how each platform consumes the result.

Common building block

Every integration shares one shell script that calls the API, reads the inbox rate from JSON, and exits with a Nagios-style code (0=OK, 1=WARN, 2=CRIT). All three platforms can consume that same script directly.

The shared check script

Save as /usr/local/bin/check_inbox_placement and make it executable. Nagios plugin convention: print a one-line status, exit 0/1/2/3.

#!/usr/bin/env bash
# check_inbox_placement - Nagios plugin for email inbox placement
# Usage: check_inbox_placement <sender_domain> <warn_pct> <crit_pct>

set -euo pipefail

DOMAIN="$1"
WARN="$2"
CRIT="$3"
API="https://check.live-direct-marketing.online/api"
KEY="${INBOX_CHECK_API_KEY:?api key not set}"

# Fetch the most recent completed test for this domain
RESP=$(curl -s --max-time 15 \
  "$API/check/latest?domain=$DOMAIN" \
  -H "Authorization: Bearer $KEY")

RATE=$(echo "$RESP" | jq -r '.summary.inboxRate // empty')

if [[ -z "$RATE" ]]; then
  echo "UNKNOWN - no recent test for $DOMAIN"
  exit 3
fi

RATE_INT=$(printf "%.0f" "$RATE")

if (( RATE_INT < CRIT )); then
  echo "CRITICAL - $DOMAIN inbox rate $RATE_INT% (< $CRIT%) | rate=$RATE_INT%;$WARN;$CRIT;0;100"
  exit 2
elif (( RATE_INT < WARN )); then
  echo "WARNING - $DOMAIN inbox rate $RATE_INT% (< $WARN%) | rate=$RATE_INT%;$WARN;$CRIT;0;100"
  exit 1
else
  echo "OK - $DOMAIN inbox rate $RATE_INT% | rate=$RATE_INT%;$WARN;$CRIT;0;100"
  exit 0
fi

Note the trailing | rate=.... That is Nagios performance data format; Zabbix and Prometheus both happily ignore it.

Nagios / Icinga

# /etc/nagios-plugins/config/inbox_placement.cfg

define command {
    command_name    check_inbox_placement
    command_line    /usr/local/bin/check_inbox_placement $ARG1$ $ARG2$ $ARG3$
}

# /etc/nagios4/conf.d/email.cfg

define service {
    use                     generic-service
    host_name               mail.acme.io
    service_description     Inbox Placement
    check_command           check_inbox_placement!80!65
    check_interval          360          ; every 6 hours
    retry_interval          60
    max_check_attempts      2
    notification_interval   1440
}

The API key goes in /etc/default/nagios4 or wherever your distro sets the Nagios environment. Never bake it into the command line — it will end up in check logs.

Zabbix

Zabbix does not use the Nagios plugin convention. Instead, it expects a raw value. Write a wrapper that outputs just the number:

# /etc/zabbix/zabbix_agentd.d/inbox_placement.conf

UserParameter=inbox.rate[*],/usr/local/bin/inbox_rate_value "$1"

#!/usr/bin/env bash
# /usr/local/bin/inbox_rate_value - print raw inbox rate for Zabbix
set -euo pipefail

DOMAIN="$1"
API="https://check.live-direct-marketing.online/api"
KEY="${INBOX_CHECK_API_KEY}"

curl -s --max-time 15 "$API/check/latest?domain=$DOMAIN" \
  -H "Authorization: Bearer $KEY" \
  | jq -r '.summary.inboxRate // -1'

In the Zabbix UI: create an item inbox.rate[mail.acme.io], type "Zabbix agent," numeric (float), update interval 6h. Then create a trigger:

{Template Email:inbox.rate[mail.acme.io].last()}<80 and {Template Email:inbox.rate[mail.acme.io].last()}>=0

Name: Inbox placement below 80% for mail.acme.io
Severity: Average

The >= 0 guard avoids paging on a temporary API failure where the wrapper printed -1.

Prometheus

Prometheus wants a metrics endpoint, not a script call. Two options: a textfile collector for node_exporter (simplest), or a dedicated exporter (cleaner, slightly more setup).

Option A: textfile collector

Have cron write a file every 15 minutes that node_exporter picks up.

#!/usr/bin/env bash
# /usr/local/bin/write_inbox_metrics - emit Prometheus format to textfile dir
set -euo pipefail

OUT="/var/lib/node_exporter/textfile/inbox_placement.prom"
TMP="${OUT}.$$"
API="https://check.live-direct-marketing.online/api"
KEY="${INBOX_CHECK_API_KEY}"

{
  echo "# HELP inbox_placement_rate Inbox placement percent per domain"
  echo "# TYPE inbox_placement_rate gauge"

  for DOMAIN in mail.acme.io marketing.acme.io notify.acme.io; do
    RATE=$(curl -s --max-time 15 "$API/check/latest?domain=$DOMAIN" \
      -H "Authorization: Bearer $KEY" \
      | jq -r '.summary.inboxRate // empty')
    [[ -n "$RATE" ]] && echo "inbox_placement_rate{domain=\"$DOMAIN\"} $RATE"
  done
} > "$TMP" && mv "$TMP" "$OUT"

# /etc/cron.d/inbox_metrics
*/15 * * * * prometheus INBOX_CHECK_API_KEY=ic_live_xxx /usr/local/bin/write_inbox_metrics

Alertmanager rule

groups:
  - name: email_deliverability
    rules:
      - alert: InboxPlacementCritical
        expr: inbox_placement_rate < 65
        for: 30m
        labels:
          severity: critical
        annotations:
          summary: "Inbox placement critical: {{ $labels.domain }} at {{ $value }}%"
          runbook: "https://wiki.acme.io/runbooks/inbox-placement"

      - alert: InboxPlacementWarning
        expr: inbox_placement_rate < 80
        for: 6h
        labels:
          severity: warning
        annotations:
          summary: "Inbox placement degraded: {{ $labels.domain }} at {{ $value }}%"

Why the for: clause matters

Without for: 30m, a single noisy test result (one seed mailbox timing out) pages on-call. The for: clause requires the condition to hold for a window, which suppresses single-test blips. Critical alerts get 30 minutes; warnings get six hours.

Common caveats

API key in monitoring config is a secret. Put it in Ansible Vault, Zabbix macros, or Kubernetes secrets — not in version control.
Monitor the monitor. If the check script itself fails (network timeout, DNS failure), Nagios returns UNKNOWN. Configure UNKNOWN to alert after two consecutive failures, not after one.
Do not check too often. Every 15 minutes for the metric write, every 6 hours for actual test generation. You do not need a new seed test every quarter-hour; the placement does not change that fast.

FAQ

Which platform should I pick if I am starting from scratch?

If you already monitor anything else, use that. If you are starting fresh in 2026, Prometheus + Alertmanager + Grafana is the default for most ops teams. Nagios is showing its age but still runs reliably on tiny boxes. Zabbix sits between them.

Can I use Datadog / New Relic / Grafana Cloud?

Yes. Any tool that ingests custom metrics via HTTP POST or statsd works. The shell script pattern is the same; just post to your ingest endpoint instead of a Slack webhook or a textfile collector.

Is a 6-hour check interval too slow?

No. Placement is a trailing signal; real regressions take hours to manifest because MTAs learn about domain reputation over a rolling window. Sub-hourly checks pay for more noise than signal.

What about dependency alerts? E.g. SPF/DKIM DNS records breaking?

Separate check, same pattern. See the DNS records monitoring article for the DNS-side checks.

Alert on "emails going to spam" from Nagios, Zabbix, or Prometheus