Building an AI Email Deliverability Agent With MCP and LangChain

Claude Desktop is fine for one-off debugging, but the moment you want an agent that wakes up on a schedule, opens a pull request, or runs as part of a larger pipeline, you need a programmatic framework. LangChain is the obvious choice: it has first-class MCP support through langchain-mcp-adapters, a stable tool calling loop, and a thriving ecosystem. This article walks through building a working agent end-to-end.

What we are building

An agent that takes an HTML template and a sender domain, runs a placement test through the Inbox Check MCP server, reads the verdict, inspects your DNS (via another MCP server or direct API), proposes a fix, and opens a pull request against your infrastructure repo. End-to-end, in about 200 lines of Python.

The stack

LangChain — orchestration, tool loop, LLM abstractions.
langchain-mcp-adapters — bridges MCP servers into LangChain Tool objects.
Inbox Check MCP server (ldm-inbox-check-mcp) — exposes start_test, get_test, list_providers, list_test.
GitHub API — for the PR step (or your favourite VCS).
Any tool-calling LLM. We will use Claude 4 Sonnet for this walk-through, but GPT-4o or Gemini Flash both work.

Bootstrapping the project

Start fresh, pin versions, and use a virtualenv. The adapter library moves quickly — pinning saves pain.

mkdir deliverability-agent && cd deliverability-agent
python3 -m venv .venv
source .venv/bin/activate

pip install \
  "langchain>=0.3" \
  "langchain-anthropic>=0.3" \
  "langchain-mcp-adapters>=0.1" \
  "langgraph>=0.2" \
  "PyGithub>=2.3" \
  "python-dotenv>=1.0"

Create a .env file with your keys. Never commit it.

ANTHROPIC_API_KEY=sk-ant-...
INBOX_CHECK_API_KEY=ic_live_...
GITHUB_TOKEN=ghp_...
GITHUB_REPO=yourorg/infra

Connecting the MCP server from Python

langchain-mcp-adapters spawns the MCP server as a subprocess and wraps each tool as a LangChain-compatible StructuredTool. The agent then calls it exactly like any other tool.

# agent.py
import asyncio
import os
from dotenv import load_dotenv
from langchain_anthropic import ChatAnthropic
from langchain_mcp_adapters.client import MultiServerMCPClient
from langgraph.prebuilt import create_react_agent

load_dotenv()

async def build_agent():
    client = MultiServerMCPClient({
        "inbox_check": {
            "command": "npx",
            "args": ["-y", "ldm-inbox-check-mcp"],
            "transport": "stdio",
            "env": {
                "INBOX_CHECK_API_KEY": os.environ["INBOX_CHECK_API_KEY"],
            },
        },
    })

    tools = await client.get_tools()
    llm = ChatAnthropic(
        model="claude-sonnet-4-20250514",
        temperature=0,
        max_tokens=4096,
    )

    agent = create_react_agent(
        llm,
        tools,
        prompt=DELIVERABILITY_SYSTEM_PROMPT,
    )
    return agent

The create_react_agent helper from LangGraph gives you a standard ReAct loop: think, call tool, observe, repeat. For this use case that is the right abstraction — we do not need a custom graph.

The system prompt

The system prompt is the single biggest lever. Give the agent specific instructions about what to do, in what order, and what the output should look like. Vague prompts produce vague agents.

DELIVERABILITY_SYSTEM_PROMPT = """
You are an email deliverability engineer. When the user gives you an
HTML template and a sender domain, you will:

1. Call start_test with the template HTML and sender domain.
2. Poll get_test every 10 seconds until status == "complete".
3. Read the verdict: inbox rate, spam rate, auth pass/fail,
   per-provider placement, SpamAssassin score.
4. If authentication failed, explain which record is wrong and
   produce a concrete diff: the exact DNS record to add or change.
5. If SpamAssassin score > 5, list the triggered rules and propose
   content edits.
6. Produce a short summary (max 8 lines) suitable for a PR description.

Do NOT invent providers or DNS records. Only use data from tool calls.
If a tool call fails, report the error verbatim to the user.

Final output format (strict):
---
VERDICT: <PASS|FAIL>
INBOX: <n>/<total>
AUTH: SPF=<pass|fail>, DKIM=<pass|fail>, DMARC=<pass|fail>
TOP_ISSUES:
- <issue 1>
- <issue 2>
PROPOSED_FIX:
<exact DNS record or content change>
---
"""

Why the strict output format matters

If downstream code parses the agent's reply (to open a PR, post to Slack, file a ticket), free-form prose is a reliability disaster. Strict fenced output with named fields is cheap to parse with a regex and saves hours of flaky post-processing.

The tool calling loop

LangGraph handles the loop, but it helps to understand what is happening. Each iteration: the LLM either emits a final answer or a tool call, the adapter forwards the tool call over JSON-RPC to the MCP server, the server runs it, the result goes back to the LLM. Repeat until final answer or a step limit.

async def run_agent(html: str, domain: str):
    agent = await build_agent()

    user_message = (
        f"Run a placement test for sender domain {domain} "
        f"with this HTML template. Report verdict and propose a fix:\n\n"
        f"{html}"
    )

    result = await agent.ainvoke(
        {"messages": [("user", user_message)]},
        config={"recursion_limit": 25},
    )

    final = result["messages"][-1].content
    return final

if __name__ == "__main__":
    html = open("template.html").read()
    domain = "news.mybrand.com"
    print(asyncio.run(run_agent(html, domain)))

The recursion_limit caps tool iterations. For this workflow 25 is generous — a typical run uses 6–8 steps (start, poll x 3–5, read verdict, respond).

A sample run

Here is a lightly redacted transcript from a real run against a broken SPF record:

> Tool call: start_test
  args: { domain: "news.mybrand.com", html: "..." }
< { test_id: "t_01HXAB...", status: "queued" }

> Tool call: get_test
  args: { test_id: "t_01HXAB..." }
< { status: "running", progress: 0.3 }

> Tool call: get_test
  args: { test_id: "t_01HXAB..." }
< { status: "complete", inbox: 6, spam: 11, missing: 3,
    auth: { spf: "fail", dkim: "pass", dmarc: "fail" },
    spamassassin: 3.8, providers: { gmail: "spam", ... } }

> Final answer:
---
VERDICT: FAIL
INBOX: 6/20
AUTH: SPF=fail, DKIM=pass, DMARC=fail
TOP_ISSUES:
- SPF does not include your ESP sender IPs (SendGrid).
- DMARC fails because SPF fails and the from-domain does not align.
PROPOSED_FIX:
Update TXT record at news.mybrand.com:
  v=spf1 include:_spf.google.com include:sendgrid.net ~all
---

Error handling

Three classes of errors you should handle explicitly:

MCP server crashed. The subprocess died. Respawn, or propagate the error. Do not silently retry forever.
API rate limit (429). The Inbox Check tool layer returns a structured error. Backoff 10s, retry up to 3 times.
Agent step budget exhausted. The recursion limit fired. Log the partial state and give the user a partial answer.

from langgraph.errors import GraphRecursionError

try:
    result = await agent.ainvoke(
        {"messages": [("user", user_message)]},
        config={"recursion_limit": 25},
    )
except GraphRecursionError:
    # Return whatever we have + a flag so the caller can decide.
    return {"status": "incomplete", "reason": "step budget exhausted"}

Wiring the PR step

The agent output above ends with a fenced block containing the proposed fix. Parse that block, turn it into a file edit, and open a PR.

import re
from github import Github, InputGitAuthor

FIX_RE = re.compile(
    r"PROPOSED_FIX:\s*(.*?)(?=\n---|$)", re.DOTALL
)

def open_pr(agent_output: str, verdict_summary: str):
    m = FIX_RE.search(agent_output)
    if not m:
        return None

    fix_text = m.group(1).strip()

    gh = Github(os.environ["GITHUB_TOKEN"])
    repo = gh.get_repo(os.environ["GITHUB_REPO"])

    branch = f"deliverability-fix/{int(time.time())}"
    base = repo.get_branch("main")
    repo.create_git_ref(ref=f"refs/heads/{branch}", sha=base.commit.sha)

    # Append the fix as a TODO in a tracked issues file.
    path = "dns/PENDING_CHANGES.md"
    contents = repo.get_contents(path, ref=branch)
    new_body = contents.decoded_content.decode() + \
        f"\n\n## Proposed by deliverability agent\n{fix_text}\n"

    repo.update_file(
        path=path,
        message="deliverability: propose DNS fix",
        content=new_body,
        sha=contents.sha,
        branch=branch,
        author=InputGitAuthor("deliverability-bot", "bot@example.com"),
    )

    pr = repo.create_pull(
        title="Deliverability: proposed fix",
        body=verdict_summary + "\n\n" + fix_text,
        head=branch,
        base="main",
    )
    return pr.html_url

Notice the agent does not edit Terraform or bind files directly. It writes the proposal into a markdown file that a human reviews. That is deliberate — the next section explains.

Safety: dry-run vs apply

Never let an LLM make live DNS changes without a human review step. DNS mistakes are recoverable but disruptive — a bad SPF rewrite can nuke your inbox rate for a week. The right pattern:

Dry run (default): agent proposes, human reviews, CI runs a syntax validator, a human merges.
Apply: only after the PR is merged does the change propagate (via your standard infra pipeline, e.g. Atlantis or a GitHub Actions workflow).

Keep the agent read-only at the tool layer

The Inbox Check MCP server exposes read-only placement tools. It cannot change DNS, cannot send real mail, cannot touch production. That is by design. If you add more MCP servers to this agent (Cloudflare MCP, AWS MCP), wire them through a capability filter so the agent can only read — writes go through a human-approved PR.

Next steps

The base agent is 200 lines. Things you can add:

Schedule it — cron + a long-lived sender domain, verdict snapshot in a database, alert if inbox rate drops below 85%.
Multi-campaign mode — accept a list of templates, run all in parallel (respect rate limits), compare.
Plug in a DNS-reading MCP (Cloudflare or Route53) so the agent can cross-reference the actual record with the proposed fix.
Add Slack or Linear as an output channel in addition to PRs.

Frequently asked questions

Can I use OpenAI or Gemini instead of Claude?

Yes. Replace ChatAnthropic with ChatOpenAI or ChatGoogleGenerativeAI. Tool calling works the same. Claude tends to handle strict output formats more reliably, GPT-4o is faster, Gemini Flash is cheaper.

How do I test the agent without burning API credits?

Record a few MCP tool calls with a fixture, then use LangChain's FakeListChatModel to replay a canned LLM response. For end-to-end tests, use the Inbox Check sandbox mode which returns deterministic responses.

Why LangGraph and not plain LangChain AgentExecutor?

AgentExecutor is being deprecated in favour of LangGraph. LangGraph has clearer state handling, better retries, and explicit step budgets. For a new project in 2026 there is no reason to use the old AgentExecutor.

Can the agent run without an LLM — just the MCP tool calls from Python?

Yes. Skip LangChain entirely and use the mcp Python client directly. That is the right choice for deterministic cron jobs. LangChain only adds value when you want natural-language input or reasoning between tool calls.

Build a deliverability agent with LangChain + MCP