AI Agents That Check Email Deliverability Autonomously (A2A + MCP Demo)

Most "AI agent" demos are thin wrappers around a single LLM call. This one is not. The agent below discovers a third-party service it has never seen before, negotiates capabilities, runs a multi-step tool chain, and produces a structured diagnosis of why an email campaign is failing — without anyone reviewing intermediate steps. The entire trace runs in about 90 seconds.

Below is the stack, the five steps the agent takes, a real trace from a run this week, and an honest list of the places where autonomy still breaks.

What 'autonomous' means here

No human approves intermediate steps. The operator types one sentence ("check deliverability for acme.com") and the agent drives the entire loop — discover tools, run tests, interpret results, propose fixes. The only human review is on the final output. This is sometimes called a level-3 agent.

The stack

Model: Claude Opus 4.6. Reasoning quality matters more than latency for this loop.
A2A client: a small TypeScript library that fetches .well-known/agent.json, validates the card, and negotiates transport.
MCP adapter: ldm-inbox-check-mcp running as a local stdio subprocess. The A2A client chose this over plain JSON-RPC because the agent card advertised it as the preferred transport.
Orchestration: LangGraph is optional. A plain while loop over tool-use messages works for this small a task. We used LangGraph because we want the same harness for multi-agent handoffs later.
Output schema: a Zod schema the model must fill — findings, severity, recommended DNS changes, placement summary.

Step 1: discover the Inbox Check card

The operator provided one hostname: check.live-direct-marketing.online. The agent's first call is to fetch the well-known card. The A2A client returns a parsed object with capabilities, transport options and auth. At this point the agent knows what Inbox Check can do — without any pre-registered integration.

Step 2: capability negotiation

The card lists both jsonrpc and mcp transports. The client picks MCP because it has the adapter available. It reads the auth section, finds an api_key scheme, and pulls the key from the local env. The capabilities relevant to this run are start_test, get_test, check_auth, and check_blacklist.

Step 3: MCP tool-call loop

With tools registered, the model starts a loop:

Call check_auth on the domain — see if SPF/DKIM/DMARC are already sane.
Call start_test with a realistic sample HTML (the operator pastes one in, or the agent fetches the latest template from a provided URL).
Poll get_test every 15 seconds until status is complete.
If the per-provider placement shows a cliff at one provider, call check_blacklist on the sending IP.

Step 4: diagnosis reasoning

Once every tool has returned, the model sees a structured result: counts of inbox/spam/missing per provider, an auth verdict, optional DNSBL hits, a SpamAssassin score. The prompt asks it to produce the most likely root cause and the smallest sufficient fix — not every possible improvement. The Zod schema constrains the shape of the answer, so the model cannot hand-wave.

Step 5: structured report and proposed fixes

The final object is emitted as JSON and rendered into a short human summary. If the agent is wired into a ticketing system, this is the payload for a new ticket; if into a chat, it is the message.

Real trace from a run

Below is an abbreviated log from a run against a production cold-outreach sender. Domain and keys redacted.

[00:00] operator> check deliverability for acme.io, here is the HTML (attached)
[00:01] a2a.fetch_card  host=check.live-direct-marketing.online
        → 200 OK, card v1.2.0, 5 capabilities, transports=[jsonrpc, mcp]
[00:01] a2a.pick_transport  → mcp (ldm-inbox-check-mcp installed locally)
[00:02] mcp.call check_auth {domain:"acme.io"}
        → spf:fail (include missing for SendGrid)
          dkim:pass (selector s1)
          dmarc:present p=none rua=report@acme.io
[00:03] mcp.call start_test {from:"news@acme.io", subject:"Spring news", html:"..."}
        → test_id=t_01j8q3x7n8...
[00:18] mcp.call get_test {id:"t_01j8q3x7n8..."}
        → status=running  inbox=3/20  spam=1/20  missing=16/20
[00:33] mcp.call get_test {id:"t_01j8q3x7n8..."}
        → status=complete
          inbox=11/20  spam=6/20  missing=3/20
          gmail: 2/4 inbox, 2/4 spam
          outlook: 1/4 inbox, 3/4 spam
          yahoo: 4/4 inbox
          mail.ru: 0/4 (all dropped — suspicious)
          spamassassin=3.8
[00:34] mcp.call check_blacklist {ip:"149.72.x.x"}
        → spamhaus_zen:listed (PBL — residential-like assignment)
          sorbs:clean
          barracuda:clean
[00:36] model.diagnose  →
          root_cause: "SPF softfail due to missing SendGrid include + PBL listing of sending IP"
          severity: "high"
          fixes:
            - "Publish SPF: v=spf1 include:sendgrid.net include:_spf.google.com -all"
            - "Request delisting of 149.72.x.x from Spamhaus PBL via sender.spamhaus.org"
            - "Move DMARC from p=none to p=quarantine once SPF passes for 14 days"
          per_provider:
            gmail: "mixed — SPF alignment is the likely driver"
            outlook: "worst placement — SmartScreen reacts to SPF softfail + PBL"
            mail.ru: "all dropped — PBL listings hit Mail.ru hardest"
[00:36] agent→operator  rendered summary + ticket draft

What the agent did not do

It did not rewrite the HTML, touch DNS, or change any account setting. Everything it touched was read-only (check_* and get_*) or a test send against seed mailboxes. The only outputs are a report and a proposed DNS change. This is the safety boundary by design.

Safety constraints

Read-only MCP surface. No delete_*, publish_* or billing capabilities exposed to the agent.
Output schema gate. The model's reply must validate against the Zod schema or the loop retries.
Rate-limit awareness. The MCP client respects HTTP 429 and backs off. A runaway loop hits the rate limit before it hits the bill.
Human review on high-severity. If severity=high, the agent pauses before the final action (file the ticket) and waits for a human nod.

Where this breaks today

A few honest limitations we hit in practice:

Long tool timeouts. Placement tests take 30–120 seconds. Most MCP clients are tuned for sub-second tools. Polling works but is fragile.
Multi-provider comparison. The model is bad at spotting a single-provider cliff unless you prompt it explicitly. We add a heuristic in the post-processing step.
Agent card drift. If we bump the capability schema without bumping the card version, clients cache stale shapes. Happens once a quarter.
Cost. Full run is ~40k tokens in + 4k out. Fine for a developer tool, expensive as an always-on monitor.

Frequently asked questions

Is the agent code open source?

The MCP server (ldm-inbox-check-mcp) is. The orchestration harness used for this demo is a thin wrapper — the interesting code is 80 lines of TypeScript around a LangGraph state machine.

Can this replace a human deliverability consultant?

For routine diagnoses — SPF typos, DMARC cliffs, shared-IP blacklist hits — yes, it matches a junior consultant. For strategic decisions (domain strategy, IP warming, ESP migration) it is not close.

Why MCP instead of calling the REST API directly?

Discovery and composability. MCP lets the same harness plug into a DNS-management agent, a ticket system agent, and a mailbox provider agent in one conversation. REST would mean hand-writing orchestration for every combination.

Does the agent have my email content after the run?

The agent sees the HTML while the tool call is in-flight. Nothing is retained on Anthropic's side beyond the standard retention of the API. The seed mailboxes delete bodies after 30 days.

Autonomous deliverability agent — A2A + MCP demo