When volume grew 10x in a month and we had one week to fix it

In late July 2025, Mercor engineering took on an impossible mission: rewrite a critical, bottlenecked service called Contracts – in one week, with zero regressions in production. Internally, we called it John Wick.
I've shipped hundreds of projects across my career. Not once – has any shipped with zero regressions in production (even after dedicated bug bashes & QA). Rewriting an entire service in one week is close to impossible on its own. Stacking both felt like a dare.
We took it.
The resulting system was more than 10,000x more capable. Reliability improved by over 75x. Today it's handling workloads that would have collapsed the old system entirely, and is more reliable than our vendors' own SLAs.
If impossible missions inspire you, explore our open roles on our engineering, product and design teams: mercor.com/careers
Here's how we did it
I joined Mercor because of unprecedented growth, and speed and the level of challenge on the interview process was a clear indication of my predisposition. Mercor is on the Forbes AI 50 for the second year in a row. It is an annual list of the leading private AI companies, produced with Sequoia and Meritech.
I started on the Payments team, working on critical features and correctness problems like idempotency. In late July, the VP of Engineering pulled me aside: The Contracts service was breaking, and they needed help.
I joined the team. After a day of analyzing the rising volumes and P90 metrics – I emailed our CTO and VP of Engineering.
Before August, the system handled around 3,000 active contracts a month, a scale it had been tuned for. 20 to 50 concurrent requests, each completing on the order of 100 seconds.
In early August, volume started climbing. Fast. Ops team members would extend or update an offer, wait five minutes, and watch the frontend time out. They had no idea if the operation had succeeded. Some would refresh and retry while the backend was still mid-operation. Partial successes stacked up. Nobody knew what state the system was in. Partial updates manifested into KTLO tickets.
Underneath all of it: most operations were making calls to external systems running at two nines of reliability, with aggressive rate limits and no proper retry or fallback logic in place. Every failure mode that had been hiding at low volume was now fully visible.
Gift of scale
Contracts is the system every contractor on Mercor flows through. Every hire, extension, update, transfer, offer acceptance, and onboarding goes through it. So does everything that defines the relationship between a company and a contractor – payment terms, permissions and access, eligibility, etc. When Contracts service works, nobody notices. When it doesn't, a contractor doesn't know if they've been hired and a company can't tell if an update went through.
The original system had handled the first 10x — around 3,000 active contracts a month, 20-50 concurrent operations, each on the order of tens to hundreds of seconds. The next wave of growth was a different problem.
But between July through October, contract volume grew roughly another 10x. Total system operations grew more than 25x. The failure modes that had been latent at low volume became continuous at high volume: timed-out frontends, retries on operations that may or may not have already succeeded, partial updates that the database and the UI disagreed about.
The challenge
Scale reveals everything. Every assumption baked into the original design becomes a liability the moment volume climbs past what the system was built for. Contracts had several.
1. Complexity you can't see until it breaks
A single hire request was synchronously touching: company data, IAM permissions, referral checks, payment readiness, onboarding steps, background check status, and multiple external vendors.
At 20 concurrent operations, this was fine. At 100+, every downstream system was getting hammered simultaneously. Different vendors would hit rate limits, timeouts, and surface low-occurring bugs in their systems. MongoDB connection pool exhaustion. All failing at different moments, making analysis nearly impossible. The system failed in a different way every time making maintenance very difficult.

2. Optimistic responses are a hidden time bomb
A dangerous pattern in the codebase was code that assumed success. A bulk update function would touch 30 records sequentially, assume all succeeded, and return a 200. If 3 of 30 failed, nobody knew. The frontend showed success. The database disagreed. Downstream systems acted on a state that didn't exist.
At low volume, you catch this in support tickets and fix it by hand. At scale, optimistic responses become a distributed consistency problem. There’s no way to understand what actually happened.
3. External dependencies need reliability wrappers before you need them
Every Contracts operation – hire, transfer, dismissal, update – made calls to external services: Time-tracking providers. Communication platforms. Payment processors. These calls had no retry logic, no circuit breakers, no graceful degradation. Rate limits were aggressive. And HTTP status codes were not always correct.
At 50 operations a day, this is tolerable. At 500+, you see every failure mode every vendor has, plus the rare bugs in your own system that low volume had been hiding. Scale exposes the long tail — the more operations you run, the more often you hit each rare failure (failure rate × volume).
4. Engineering toil becomes the product
When failures cascade and correctness can't be guaranteed by the system, engineers become the reliability layer. Manually fixing database state. Correcting vendor API calls. Triaging tickets from Ops who don't know whether their operation succeeded.
At some point, more than 60% of engineering time was going to manual triage and maintenance, not engineering. Meanwhile, the business couldn't stop – new features were still being demanded, timelines hadn't moved. We needed to rebuild a critical system for our new scale while keeping the existing one running.
5. You can't fix what you can't see
The growth had outpaced the technology, and the technology had outpaced any documentation. One engineer on the team carried deep context from the original build; while Contracts was being rewritten, he kept critical business features shipping in parallel so the rest of the company didn't slow down.
The rest was reverse engineering. We had to identify the critical paths, address observability, and reconstruct the tribal knowledge behind the code. There was no structured visibility into which operations were in-flight, which had partially succeeded, or which had failed silently — meaning every triage started from scratch. Before we could fix integrity, we had to be able to measure it.
Other constraints (an in-flight Kafka migration, no Temporal, the frontend API contract frozen), ruled out the obvious options and forced a leaner design.
Paving the way
The new system
We built this in three layers: API → validation → database. Core reads & writes are O(1) on average. Every side effect is async—the hire endpoint returns fast, and downstream work runs in the background through a uniform execution framework with built-in retries, idempotency, and observability.

EventExecution: every operation has one declared intent
Every async handler in the new system inherits from a single base class with exactly two methods.
should_execute() is where idempotency lives. Before anything runs, the handler checks whether it already happened. A retry after a timeout, a duplicate event from a queue, a webhook redelivered — all of them hit should_execute() first and skip cleanly. No double-fired hire emails. No duplicate payment events.
execute() is the business logic. It returns a typed result — ok, fail, or skip — with optional retry scheduling.
The discipline this enforces is the point. Every operation has a single, declared intent. No side effects hidden in the call stack. You can look at any handler and know exactly what it does, when it runs, what it returns, and how it retries.
Observability, by default
The typical pattern with observability: write the business logic, bolt on logging later, add metrics after that, and six months later you still can't tell why a specific operation failed last Tuesday.
The framework emits observability by default. Because every handler inherits from the same base class, instrumentation is structural. Every operation reports its own outcome, timing, and retry history automatically, and the framework pulls the identifying context (which job, which contractor, which company) into every log line on its own.
Every record tells you which operation failed, on which attempt, with what payload. The handler author writes none of it.
From the first day the new system handled live traffic, we had full visibility. Engineers and Ops could see the state of every contract operation in real time. We built an internal dashboard before the first endpoint went live.
Boring by design
The architecture was deliberately boring. We kept the database schema frozen. We kept the frontend API contract frozen. Changing either would have required coordinated migrations we didn't have time for, and constraining scope to the application layer eliminated entire categories of risk.
Single-responsibility functions. Flat call stacks. Files aimed at 200 lines. Low cyclomatic complexity. Nothing clever. Readability => Maintainability.
How we got there
Analyze first, code second. The first two days we wrote zero product code. We analyzed every primary endpoint to identify common patterns and shared logic, catalogued every critical and non-critical issue, and mapped every external dependency — reasoning through exactly what happened when each one went down. That work defined the architecture: what needed to be rewritten, what could be preserved, and where the real risk lived.
Ship behind a fallback. We didn't delete the old code. We built a route_to_api decorator on each original endpoint that, when a feature flag was enabled per company, per user, or per project, transparently routed to the new implementation. If the new system threw, it fell back to the old one and logged the failure. One flag flip to revert; no deployment required. The fallback is what gave us the confidence to ship at all.
Parity before confidence. Before cutting traffic, we built parity tooling that replayed requests through both paths and compared outputs at three levels: API responses, database state, and vendor call behavior. Does what the system persists match what the API returns? Does what the API returns match what the vendor received? Parity at all three is the only honest definition of correctness for a system this interconnected. No-op dry run capability is underrated.
The result
Over the rewrite window, we built the foundation and ported the most critical, highest-volume endpoints onto it. Together they carry the bulk of Contracts' traffic – the paths that were causing failures at scale.
EventExecution was as much a template as a framework. Once the pattern was in place, the rest of the system had a clear path forward: every async operation declares its intent, runs idempotently, retries on failure, and emits its own observability – same shape every time.
The boundary was the point. Inside the scope of the rewrite – the highest-volume, highest-risk paths – we held the bar at bug-free, and hit it. Outside that operability boundary, the known issues stayed known issues, on a list we'd come back to.
Scale: more than 10,000x
The old system could handle 20 to 50 concurrent operations, each completing on the order of 100 seconds. The new system processes async operations that return in milliseconds — the heavy work happens in the background, reliably, with automatic instrumentation and observability. Measured in requests handled per second, the capability improvement exceeds 10,000x.
Volume grew dramatically. The system absorbed it without incident.Reliability: more than 75x in the first month and more so since then.
In the first month of full event tracking, the system processed millions of async operations with a failure rate under 0.4%, most days recording zero failures. That's better than the uptime guarantees most of our external vendors provide. Measured as the old error/failure rate divided by the new one yields an improvement of more than 75x.
Ops stopped refreshing and retrying. The partial success problem went away. The unknown-state problem, which caused most of the KTLO burden, was gone.
Engineering time reclaimed
At peak KTLO, more than 60% of engineering time was going to manual reliability work for a broken system. That number is now close to zero.
How the system is doing today
| Metric | Before | After |
|---|---|---|
| System capability | 20–50 concurrent ops ~100s each | >10,000x improvement (based on extrapolated real load and test-load) |
| Contract volume growth | Baseline | >9x in under a month (Today: ~20x) |
| Operations volume growth | Baseline | >25x within 2 months |
| Reliability | Cascading failures, unknown state | >75x improvement (<0.4% failure rate) |
| Engineering KTLO | >60% of eng time | ~0% |
| Response time (new ops) | Minutes (synchronous) | Order of 100 ms (async) |
| Peak operation load | Baseline | >200x peak to peak operational load |
What’s next
We’re building a world class technology and engineering org that can scale with Mercor's growth.
Reliability and integrity
Reliability at 75x improvement is a milestone, not a destination. The next phase is pushing toward the kind of system integrity that is provable. Invariant definitions and assertions at service boundaries. Auditability built into the system from the start. Every state transition is traceable, every outcome explainable.
Product as platform
The EventExecution model already makes it straightforward to add new consumers for new operations, a new side effect is its own independent class, cleanly separated from everything else. The next step is taking that further: turning Contracts from a service into a platform that a product can build on top of, without pulling in an engineer for every change. Service level boundaries defined clearly enough that new product surfaces don't create new reliability risks. Extensibility that grows with the business instead of creating new complexity.
Engineering that enables, not constrains
The goal is zero KTLO. A system that is stable and well-instrumented that engineers spend their time building, not maintaining. The kind of codebase where adding a feature feels like writing a paragraph. Contracts is upstream of everything — integral to Payments, the source of truth for who to pay, how much, and when. Today, Mercor processes more than $3M a day in contractor payments, and none of it moves correctly without Contracts working. We're building the foundation to handle whatever scale comes next.
Come build with us
Mercor is growing faster than most companies ever will, and the systems that support that growth have to be built by people who care about getting it right. If you want real scope, real problems, and the chance to redefine what's possible – we'd like to talk: mercor.com/careers
