The free tier of Inbox Check runs on a single dedicated server. Not a cluster, not an autoscaling group, not Kubernetes. One box. It handles roughly 1,000 inbox placement tests a day, returns results in under two minutes, and costs us less than a Netflix subscription in compute. This is the architecture that makes that possible.
The constraints
Every architecture decision is downstream of three hard constraints we set at the start:
- Free tier must stay free. If we cannot serve a reasonable volume on a single server we cannot keep the free tier. That forces efficiency in every layer.
- Sub-two-minute latency. A test starts when a user clicks Run. Results need to be useful before the user gets bored. Two minutes is the upper bound; one minute is the target.
- Twenty-plus provider coverage. Every test seeds every provider in the pool. We do not sample. A test either reports all providers or it reports none.
High-level shape
user (browser)
|
| POST /api/test (SMTP creds or one-click send address)
v
+---------------------------------------------+
| Next.js API (app router, node runtime) |
+---------------------------------------------+
|
| enqueue job
v
+---------------------------------------------+
| BullMQ (Redis) |
| test-queue --> worker(s) |
| seed-poll --> worker(s) |
| screenshot --> worker(s) |
+---------------------------------------------+
| | |
| | |
v v v
+--------------+ +--------------+ +--------------+
| seed pollers | | browser | | SMTP tx |
| (IMAP/API) | | queue | | (to seeds)|
| | | (Puppeteer) | | |
+--------------+ +--------------+ +--------------+
| | |
+--------+-------+--------------+
|
v
+------------------+
| Postgres |
| (state, logs, |
| results) |
+------------------+
|
| server-sent events
v
user (browser)
live result stream
The browser queue: Puppeteer with a bounded pool
The heaviest part of a placement test is reading seed mailboxes on providers that do not expose a usable API (Mail.ru, Yandex, some French providers, ProtonMail). For those we keep a logged-in Puppeteer session and navigate the web UI.
Puppeteer is memory-heavy. A naive implementation leaks Chrome processes and OOMs the server within hours. We solved this with a bounded browser queue:
- At most N (currently 6) concurrent Chrome instances across the server. A seventh request waits in a FIFO queue.
- Each Chrome instance is recycled every 50 jobs or 30 minutes, whichever comes first. No long-lived sessions.
- Session cookies for each seed are persisted to disk and restored on Chrome start, so we do not log in every recycle.
- A watchdog kills any Chrome that exceeds 800MB RSS or 60s on a single navigation. The job is retried on a fresh browser.
Bounded pool plus aggressive recycling is the whole trick. The server stays stable for weeks between planned restarts.
Seed-mailbox pollers
Providers that offer a usable API (Gmail, Google Workspace, Microsoft 365, Outlook, Zoho, FastMail) get a much lighter path. Each of those seeds has a poller that watches for new mail and reports folder placement to the main job.
Gmail uses the Gmail API with watch (push) notifications over a Pub/Sub-style endpoint we host ourselves. Microsoft uses Graph with change notifications. IMAP-only providers get IDLE connections that wake on new mail. The result is that for API-friendly providers, a test message's arrival is observed within a few seconds of delivery; for web-UI providers, we poll the UI every 10–15 seconds during a test's active window.
Postgres schema overview
The data model is small. Four tables do most of the work:
tests
id uuid primary key
created_at timestamptz
sender_domain text
message_id text
status enum(queued, sending, observing, done, failed)
summary jsonb
seed_results
id uuid primary key
test_id uuid references tests
provider text -- gmail_consumer, outlook_consumer, ...
folder text -- inbox | spam | promotions | updates | unknown
observed_at timestamptz
screenshot_url text null
seed_mailboxes
id uuid primary key
provider text
account_id text -- opaque identifier, not exposed
state enum(active, soaking, retiring, archived, suspect)
canary_score jsonb
incidents
id uuid primary key
started_at timestamptz
resolved_at timestamptz null
severity enum(sev1, sev2, sev3)
component text
postmortem_url text nullA test's result is the join of tests and every seed_results row that carries its id. We use a partial index on status IN ('queued', 'sending', 'observing') to keep the hot path small.
BullMQ for job orchestration
BullMQ on top of Redis is the backbone. Three logical queues: test-queue (orchestrates a run), seed-poll (checks a seed for the expected message), screenshot (captures the folder screenshot once placement is known). Each has its own concurrency cap tuned to the server's CPU and memory budget.
One thing we rely on that is not obvious from BullMQ's documentation: the repeatable-job feature drives the canary campaign. Every hour a single scheduled job fans out into one known-good and one known-bad send per seed. No cron, no systemd timer, no second system to operate.
SSE for live results
The user's browser holds a server-sent events connection to /api/tests/:id/stream for the duration of the run. As each seed reports placement, a row is written to seed_results and the SSE handler pushes an update. SSE is perfect for this: one-way, survives proxies, trivially resumable if the connection drops mid-run.
We considered WebSockets and rejected them — bidirectional is not needed, and SSE is cheaper on the server.
In production the box runs three Node processes under PM2: the Next.js web app, the BullMQ worker pool, and the Puppeteer browser-queue worker. Postgres and Redis are local, served over the loopback interface. Total resident memory under full load is about 6GB.
Cost breakdown
- Dedicated server: roughly $90/month for an 8-core / 32GB box.
- Residential proxies for a few providers: $40/month.
- Workspace / paid seed accounts: $150–200/month.
- Domain, DNS, email-sending ESP for transactional: ~$30/month.
- Object storage for screenshots: ~$5/month at current volume.
Total monthly run cost is a few hundred dollars. On 1,000 tests/day that is a small fraction of a cent per test, which is why we can keep the free tier unmetered.
Failure modes we hit (and fixed)
- Chrome memory leak. Solved with the 50-job / 30-minute recycle policy described above.
- Redis eviction on memory pressure. We were running Redis with default
maxmemory-policy, which silently evicted queued jobs. Switched tonoevictionand sized memory deliberately. - Postgres autovacuum freeze on the results table. Heavy insert traffic plus jsonb summaries produced a high bloat ratio. Partitioning by month and running a scheduled vacuum full on partitions older than 60 days sorted it out.
- Provider rate limits. Gmail will throttle the Gmail API if you hit it too hard across a single project. We split across two Google Cloud projects and a token bucket in front of the client.
What will break at 10,000/day
We are honest about where the current design runs out:
- Browser queue becomes a bottleneck. 6 concurrent Chrome instances cannot service 10,000 tests/day with 20+ providers unless we grow the queue, which means more RAM, which means a second server.
- Postgres writes. At 10,000 tests/day the
seed_resultstable adds ~200,000 rows a day. Partitioning covers it for a year, then we will want to push older partitions to cold storage. - Seed pool load. A single seed mailbox receiving 10,000 test messages a day on a consumer provider is a red flag. We would add a second seed per provider and round-robin.
None of these is a rewrite. Each is a routine scaling step. The current box comfortably handles 2,000–3,000 tests/day on a busy afternoon, which is plenty of headroom for the free tier.