March 2026Engineering

Engineering Monty: Scaling an AI Interviewer

Every nine seconds, someone starts an interview with Monty, our AI interviewer. That someone might be a software engineer, a banker, a farmer — and none of them are talking to a person. Around 10,000 of these conversations happen every day, each one fifteen minutes long, across hundreds of job categories. Three engineering problems make this hard: keeping every session alive, making turn-taking feel natural, and making sure every candidate gets an interview that's right for them.

Keeping every session alive

Every interview has to work. A candidate who gets a broken session doesn't get a second chance at that role. Each session runs in its own container on Modal — spun up on demand, torn down when the call ends. A crash in one container affects one interview and nothing else. In 2025, we scaled from a few hundreds interviews each week to over ten thousand a day. We made no changes to the hosting. What we did change was how we deal with cold starts.

When a candidate clicks “Start Interview,” the session should begin immediately. A fresh container takes several seconds to boot and spin up a video room — long enough that a candidate would notice they're waiting. We handle this with a warm pool: Modal keeps about 30 containers pre-booted at the compute level, and a background job running every five minutes keeps about 10 fully initialized — room URL registered in Redis, ready to go. When a session starts, it grabs one in well under 200ms. The harder problem is calibrating pool size: too few and cold starts leak through at peak; too many is waste. We track demand by hour and size the pool ahead of it.

The same logic applies everywhere else in the stack. Audio runs on Daily — Pipecat's WebRTC layer — handling peer connections, media routing, and cloud recording so recordings land in S3. Speech recognition, the LLM, and text-to-speech each run across a mix of commercial APIs and open-source models, with automatic failover at each stage — any individual outage is invisible to the candidate. Whenever we make a config change, we blue-green it over a week, meaning a bad prompt update gets caught well before it ruins a thousand interviews.

Container count — typical day (Pacific)

82 containers · 12:00 am PT

12 am6 am12 pm6 pm11 pm

Figure 1 — Container count by hour (Pacific Time), calibrated to now. Each interview runs in its own isolated container; a warm pool keeps cold-start latency under 200ms. Peak at noon: ~200 containers. Floor overnight: ~80. Drag to explore.

Getting turn-taking right

Sounding like a human is hard!

A response that would land naturally at 800ms feels like an interruption at 400ms and dead air at 1500ms. The pipeline runs on Pipecat, an open-source voice AI framework created by Daily, fully streaming end-to-end: speech recognition runs continuously, turn detection fires as soon as silence is detected, and TTS starts synthesizing on the LLM's first sentence before the rest has finished generating. For turn detection, we use smart-turn-v3 — an ONNX model from the Pipecat team, running on Modal — at P50 ~150ms. Add LLM first-token (~350ms) and TTS first audio (~200ms), and the median time from candidate silence to Monty's first word is about 700ms. Anything past a second starts to feel broken.

End-to-end latency — candidate silence to Monty's first audio

Figure 2 — Smart-Turn classifies the end of the candidate's turn (150ms p50) → LLM streams its first token (350ms p50) → TTS produces first audio (200ms p50). STT runs as a continuous stream — no added latency. TTS begins on the LLM's first sentence while the rest is still generating.

If you cut in too early, you end up interrupting a candidate mid-thought; wait too long and it starts to feel like lag. After tuning against candidate self-reported experience and session completion rates, we settled on 900ms as the production threshold — enough pause that candidates feel heard without letting silence linger. This breaks down into three parameters: a 120ms floor prevents triggering on mid-sentence pauses, a 1.6s ceiling is the hard fallback, and the right setting shifts by interview segment, and some special handling for VAD. These numbers came out of rounds of A/B testing against session completion rates; none were set once and left alone.

End-of-turn pause distribution — drag to adjust threshold

← shorter: more interruptionslonger: more silence →

Figure 3 — Each circle is a candidate's end-of-turn pause duration. Red dots are candidates Monty interrupts; green dots feel natural; yellow dots are candidates left waiting in silence. Drag the slider to see the tradeoff.

Threshold tuning only gets you so far. Short acknowledgments — “yes,” “uh-huh,” “got it” — often lack trailing silence: the candidate is done, but voice activity detection (VAD) never fires because the signal is too brief. A 400ms aggregation timeout catches these; without it, Monty waits indefinitely for speech that has already ended.

The other gap is echo: Monty's TTS output can leak through the candidate's microphone and get transcribed as candidate speech. We run a simple LLM-based classifier that detects when the candidate is essentially repeating Monty back to itself, and discards those turns. Both are invisible at small scale.

Giving every candidate the right interview

Mercor runs interviews across hundreds of job categories — engineers, bankers, lawyers, data scientists. The obvious approach is one assessment per job title; we built that first, and it doesn't scale. Maintaining hundreds of distinct assessments is expensive, and small variations between nearly-identical roles — “backend engineer” vs. “platform engineer” — don't justify separate configs. For candidates, retaking the same interview for every role they apply to is pointless.

We cluster job listings by the skills they actually test, weighted by candidate volume and hiring outcomes. The clusters are fewer than you'd expect — most job titles, stripped of their labels, test a fairly small number of underlying things. The dominant one is what we call Domain Expert: a single assessment that covers medicine, economics, history, law, software architecture, and almost any other knowledge domain. It accounts for over 70% of all sessions. Add code and language assessments and you've covered 90% of volume with three types.

Sessions per day — click a category to explore

Figure 4 — Session volume treemap (areas ∝ cumulative historical session totals by category; labels show daily counts). Domain Expert Interview is the most common assessment type by a wide margin.

Completing Domain Expert once qualifies a candidate for every role in that cluster, with no retakes required. We deploy unified assessments as the default for all listings. Today, more than half the offers on Mercor go out proactively — the candidates didn't apply to those jobs, they just took an interview some time ago.

The interview is personalized before it starts. We process the candidate's resume before the session begins, and that context shapes what gets asked, how deeply to probe, and what to skip. For coding interviews, the problem is generated fresh from that profile and the live conversation: a Go engineer and a Python engineer get different starter code; a senior candidate gets a more open-ended problem than a new graduate.

All of this means that interviews are efficient. While we think the AI interviewing experience is fun, we try to make the most of the time our users choose to spend with us.

All Blog Posts