Mercor is building the definitive network for human knowledge and capabilities.


Organizing human intelligence to power the AI economy

Mercor engineering rebuilt a critical production system under 10x volume growth, in one week, with zero regressions. Join our engineering team.

When volume grew 10x in a month and we had one week to fix it

A framework for measuring, improving, and safely deploying enterprise agents using verifier-backed judgments over agent trajectories and outputs.

Agent Eval Systems

How 2,000 expert tasks improved tool use and professional reasoning

Generalization results from training on the APEX-Agents dev set

Electric. That was the feeling I had when I walked into the Mercor office. The vibrant energy was reminiscent of my first day at Tesla, where I spent the majority of my career. The office was buzzing: a group huddled on the couches by the elevator executing their near-term roadmap, a team whiteboarding product features behind the glass panes of the conference room, people moving with urgency between rooms for the next customer call. Mercor possesses the same early-day Tesla energy, and I knew immediately this place was special. I’ve spent my career chasing that feeling: the pull to help build a generational company just as it defines a new industry.
I entered the workforce during the Great Recession: interning at Merrill Lynch on Wall Street the summer before it was fire-sold to Bank of America, and joining Citigroup in investment banking full-time after graduation. The job market was grim, so I felt fortunate to land a job in 2009 — especially in finance. The Great Recession shaped my worldview and defined my career trajectory.
During the recession, the Obama administration passed the American Recovery and Reinvestment Act: a stimulus package that poured investment into the emerging cleantech industry. On the energy team at Citi, I advised dozens of fast-growing companies in the space. Wanting to be closer to the technology, I took my first operating role at First Solar, which is today the largest American solar module manufacturer. It was my first taste of the exhilaration of being hands-on in the wild-west days of a new industry.
A couple of years later, I left for an MBA at Wharton. A new era had emerged: smartphones became dominant, social media was omnipresent, and cloud computing transformed enterprises. The Big Tech companies of today were still relatively young and rapidly evolving. I interned at Amazon during my MBA, but the company was already established, and the energy, pace, and ambition felt off.
I entered the second year with my eyes set on Tesla. Elon had radical ambition, and I was an early believer that Tesla would be a transformative, generational company. I spent an entire year hustling to be hired, and finally received an offer months after graduation to join as the first finance hire to support a new idea — a bet that would become Tesla Energy.
Although a public company at the time, Tesla was hardly mature — and Elon’s unconventional leadership style greatly influenced me. I worked in sales to assist in deliveries, and on the manufacturing floor to facilitate material handling. We were encouraged to be hands-on — and that firsthand knowledge enabled fast strategic execution to scale the business while defining an industry. We built Tesla Energy — and the commercial battery storage industry — from scratch; today it’s a $13B revenue business. With the itch to build again, I joined an early stage AI robotics company, and helped close their Series C and land a strategic partnership with FedEx.
The throughline in my career has been a desire to partner with generational founders who have the ambition to build transformative companies that define new industries. I felt the spark of inspiration immediately upon meeting Brendan and Adarsh.
A few things made Mercor an obvious choice for me. The first is the two of them. I’m inspired by founders seeking to achieve the impossible. I love the challenge of identifying and executing the path from where we are today to where we aspire to be. Brendan and Adarsh see the future clearly the way Elon does, and they propel Mercor forward to shape the world they see. Their conviction and optimism — and how Mercor will drive extraordinary change in society — is infectious: it sparks ambition in everyone around them. I’ll always remember the glee Brendan showed when he showed me Mercor’s product for the first time — proud of what the company had built, grateful for the talent it had drawn, and electric about what lies ahead.
The second is the people. Mercor reminds me of Tesla: a collective culture — each individual operates with exceptional autonomy, yet shares a deep sense of humility and interdependence in pursuit of a common goal. I’m continuously inspired by my colleagues, each driven by insatiable curiosity, optimism for what we can achieve, and agency to effect change. We’ve hired extraordinary people.
The third is the mission. Mercor operates at the unique intersection of humans and AI, and has built the human intelligence infrastructure that powers the most transformative technology in our lifetime — and, I believe, for the better. I’ve spent an inordinate amount of time in my nearly two-decade-long career on manual, tedious, and repetitive work. We are creating the largest new job category, and advancing the frontier to unburden each profession of repetitive tasks and free people to do work that matters.
Mercor is creating a new industry that will reshape the economy, and is growing faster than any company in history. Finance is at the center of building the foundation for scale: data infrastructure, systems implementation, operational and financial visibility — and we are tasked to lay that groundwork on a compressed timeline. What excites me most is that what we’re building will change finance itself — a function that has historically been intensely manual. Legacy companies run armies of people managing general-ledger entries, expense approvals, financial analysis, and business-intelligence dashboards. At Mercor, we are designing the future of finance: a finance function built in an agentic world. Finance here will build the evaluations and train the agents that do that work, freeing us to spend our time drawing business insights and driving execution.
We will be defining what finance will be. That’s the work, and it’s why I’m hiring. If that’s the thing you want to be doing, come build it with us: <a href="http://www.mercor.com/careers" target="_blank" rel="noopener noreferrer" class="text-indigo-600">www.mercor.com/careers</a>.

Mercor's CFO spent eight years at Tesla. He joined for the same early-day energy, the founders, and the chance to redefine finance.

Why I Joined Mercor – Kevin Shiau

Why I joined Mercor – Kevin Shiau

AI startup Mercor now valued at $10 billion with new $350 million funding round

AI's next job? Recruiting people to train more AI

AI is learning to do the jobs of doctors, lawyers, and consultants

24 AI startups to watch in 2026

Mercor in the news

In late March 2026, Mercor was affected by a supply chain cyber attack involving LiteLLM, a commonly used open-source tool. The attack impacted a significant number of companies globally. We quickly discovered the activity, secured our systems, and conducted a robust and comprehensive investigation assisted by leading third-party forensics experts. That investigation is now complete. We want to share what happened, what we found, and what we’ve done to protect our customers, experts, and employees.
<h2>What happened</h2>
In March, a malicious actor published compromised versions of LiteLLM designed to exfiltrate credentials from any system that installed them. Our security team determined that Mercor was affected during the relevant timeframe and immediately took action to contain unauthorized activity.
We worked with leading industry experts including Google’s Mandiant, Latacora, industry peers, and law enforcement to investigate and take appropriate action. We worked to get answers as quickly as possible while also prioritizing accurate information.
<h2>What the investigation found</h2>
<ul><li>Experts: Of our nearly five million experts, only a very limited subset had sensitive information affected. There is no evidence that any of this data has been used fraudulently. We are in the process of notifying these individuals directly on June 25 and June 26 from mercor@notifications.cyberscout.com. The notifications will include details about the type of information affected along with an offer of TransUnion identity protection services.</li><li>Customers: Because of how our work is structured, many of our customers operate on their own platforms rather than ours, meaning the impact to customer information was very limited. We were in direct and regular contact with our customers throughout the investigation and have shared findings specific to each of them. We are grateful for their cooperation as we worked through this situation and are pleased to report that all frontier labs have increased their work with us over the last few months.</li><li>Employees: No employee data was affected.</li></ul>
<h2>What we’ve done</h2>
<ul><li>We prioritized timely and direct communication with our customers, experts, and employees as we had information to share.</li><li>We’ve taken steps to further invest in and strengthen our security posture, including:<ul><li>Auditing all third-party dependencies</li><li>Regularly rotating all credentials and access keys across our cloud platforms, GitHub, and SaaS systems</li><li>Deploying restrictive cloud security policies and tightened network controls</li><li>Ongoing open-box penetration testing by independent security researchers</li><li>Implementing 24/7 managed detection and response</li></ul></li></ul>
<h2>Looking forward</h2>
Mercor has taken many steps to further strengthen our systems, including implementing more safeguards, expanding security protections, and enhancing our monitoring processes. We will continue to invest in our team and systems to ensure we are a trusted partner to experts and customers. We appreciate the patience, support, and trust of our community. 
We believe as AI gets more powerful, human expertise gets more valuable. We&#39;re building the company that makes that true and creates real economic value for people. We look forward to continuing that focus with them and building a generational company focused on truly consequential work.

In March 2026, a supply chain cyberattack affected Mercor. Here's what happened, what our investigation found, and our response.

Mercor Security Incident: Our Findings and Response

Update on Mercor security incident

Popular Medical AI benchmarks, such as MedXpertQA-MM, are built from clinical vignettes: self-contained patient cases that mirror what a physician encounters when someone walks into the ER. They rarely include a patient&#39;s full longitudinal record. Instead, they present a snapshot of the patient’s health including the presenting complaint, vital signs, lab values, and imaging.
As Dr. Eric Topol, cardiologist and founder of the Scripps Research Translational Institute, noted when discussing the use of AI in medical diagnosis: “It’s certainly possible to have a cardiogram with no real background on the patient… people could be brought in from the street. All sorts of reasons why you wouldn’t have any baseline data.”
Medicine constantly demands that doctors make clinical decisions with incomplete information.Our core question was whether models actually reason from the evidence in front of them, or simply follow where the story leads, without genuinely engaging with the images provided.
We tested frontier models on Mercor&#39;s MedXpertQA-MM-Pro, a dataset of board-level medical questions spanning 17 specialties like cardiology, radiology, and pathology, each paired with a clinical image. We ran each of six models five times per case. The models often led astray by the written case description, rather than reading what the image actually showed.
<h2>When the story overrides the evidence</h2>
Consider this case:
A 56-year-old man presents to a local free-standing emergency department in Miami complaining of epigastric discomfort and “acid.” The patient reports the symptoms started today after eating some spicy ceviche for lunch… He also has some mild nausea… His presenting vital signs were normal, he is well-appearing… Labs were obtained and were normal; a troponin-I was within normal limits.
What, if any, findings are there on this ECG of note to the treating physician? What are the next best steps in management?
<img src="https://cdn.sanity.io/images/h6s14f4z/production/499d6b960aa991ede43b0f3ebecc726d96d523c2-1530x754.png" alt="Patient ECG " style="width: 100%; border-radius: 10px;" />
The ECG was normal, with no acute changes. A physician reading it would recognize a low-risk presentation, recommend symptomatic care for acid reflux, and discharge the patient. Yet, every frontier model we tested failed in the same way. All of them hallucinated cardiac pathology that wasn’t present in the ECG.
<ul><li>Claude Opus 4.8 reported ST elevation on every run and hyperacute T-waves in 4 of 5 runs, framed as an overt anterior STEMI in some runs and as a &quot;STEMI-equivalent&quot; invoking the named &quot;de Winter&quot; T-wave pattern in 2 of 5 runs to argue for acute proximal LAD occlusion.</li><li>Claude Opus 4.7 reported ST elevation, reciprocal ST depression, and hyperacute T-waves on every run, diagnosing an acute STEMI and recommending aspirin, anticoagulation, cath-lab activation, and emergent transfer.</li><li>Gemini 3.1 Pro Preview invented ST elevation in the precordial leads on every run and added reciprocal ST depression in 4 of 5 runs, diagnosing an acute STEMI and urging immediate cath-lab activation, aspirin, anticoagulation, and PCI-capable transfer.</li><li>Gemini 3.5 Flash invented ST elevation with reciprocal ST depression on every run, diagnosing an acute STEMI and recommending aspirin, anticoagulation, cath-lab activation, and emergent transfer.</li><li>GPT-5.5 invoked the named &quot;de Winter T-wave pattern&quot; on 4 of 5 runs, interpreting the normal ECG as a STEMI-equivalent suggesting acute proximal LAD occlusion, and on every run urged immediate cath-lab activation, aspirin, anticoagulation, and emergent PCI-capable transfer.</li><li>Qwen 3.6 35B-A3B reported ST elevation across the precordial leads on every run, diagnosing an acute STEMI (in 2 of 5 runs an anterolateral STEMI, in 1 run invoking the &quot;tombstone&quot; morphology) with reciprocal ST depression in 3 of 5 runs, and recommending aspirin, anticoagulation, and cath-lab activation.</li></ul>
Each model converted a benign acid reflux presentation into a cardiac emergency. None of the recommended interventions were warranted.
We ran each model multiple times on the same vignette. Not once did any of them arrive at the correct answer. The hallucinations weren’t consistent but they all pointed in the same direction, whatever the story had primed them to find.
Epigastric pain is a known “cardiac trap,” a presentation that can mimic heart disease. The models appear to have been anchored by that framing from the opening sentence, then read the ECG to confirm what the story had already told them to expect. The correct answer was in the image, and none of the models read it.
<h2>Different specialty, same blind spot</h2>
The same dynamic appears across specialties. In dermatology:
A 12-year-old boy is brought in by his mother, complaining of a rash. The patient is on a wrestling team and they recently took a trip to a rural area for a wrestling competition and hiking. She notes he has been tired and his muscles have been sore since getting back… Several wrestlers have gotten ringworm, so she is concerned.
What is the best treatment option for this patient?
<img src="https://cdn.sanity.io/images/h6s14f4z/production/e14dc25a9b01b61c60aac874b369290c98c7c463-1498x958.png" alt="Patient rash" style="width: 100%; border-radius: 10px;" />
The image showed an angry, red rash without the bull&#39;s-eye pattern characteristic of Lyme disease, consistent with cellulitis from a bug bite. The correct treatment is cephalexin.
However, every model diagnosed erythema migrans (the hallmark rash of Lyme disease) and recommended doxycycline. The vignette had done its work: rural area, hiking, fatigue, a wrestling trip. None of the models read the image on its own terms. The text had already decided the diagnosis.
<ul><li>Claude Opus 4.7 picked doxycycline on every run, raising cellulitis as a differential in 2 of 5 runs but never selecting it.</li><li>Claude Opus 4.8 picked doxycycline on every run, raised cellulitis as a differential in 4 of 5 runs, and mentioned cephalexin as an option in every run, yet still picked doxycycline. One of its runs was the only run across all 30 to explicitly note the absent bull&#39;s-eye pattern, and even that run still picked doxycycline.</li><li>Gemini 3.1 Pro Preview picked doxycycline on every run, raised both cellulitis and cephalexin as differentials in every run, and considered topical antifungal therapy for tinea in 2 of 5, but still landed on Lyme.</li><li>Gemini 3.5 Flash picked doxycycline on every run, hedged toward a dermatophyte (topical terbinafine) in 3 of 5 runs, and never landed on cellulitis or cephalexin.</li><li>GPT-5.5 picked doxycycline on every run and was uniquely the only model that never raised cellulitis or cephalexin even as a differential — locked entirely onto Lyme from the wrestling and rural-area framing.</li><li>Qwen 3.6 35B-A3B picked doxycycline on every run, raised cellulitis as a differential in 4 of 5 runs, and mentioned cephalexin in every run, yet still picked doxycycline.</li></ul>
When we re-ran the same question with the image alone and removed the vignette, the same models selected the correct answer in 21 of 30 runs. Four of six models reversed completely, identifying the central punctum of a bug bite and the absent bull&#39;s-eye pattern, and recommending cephalexin for cellulitis. The image was readable all along, but the story is what decided the model’s diagnosis.
<h2>The anatomy of the failure</h2>
In both cases, the failure isn’t primarily a perceptual one. The models can process images. The behavior is more specific. The model commits to a diagnosis based on narrative context and then reads the image to confirm it. The text drives the conclusion and the image gets pulled in afterward to support it. Dr. Topol identified this directly: “These frontier models are not really geared up to do [medical image interpretation]… they look at text and not the images because they’re not so good at it.” A physician looking at that ECG would read the tracing first, then reconcile it with the clinical story. These models appear to be doing the reverse and when the image contradicts the story, the story wins.
<h2>Beyond benchmarks</h2>
Clinical AI benchmarks have long been optimized for what&#39;s easy to measure. Accuracy on held-out test sets, Area Under the Curve, sensitivity and specificity under controlled conditions. However, they&#39;ve underweighted the cases that reflect the actual conditions of clinical judgment, like incomplete histories, visual traps, conflicting signals.
The vignettes in MedXpertQA-MM-Pro aren&#39;t contrived edge cases. A patient with epigastric pain and a normal ECG is a routine presentation. Good clinicians get it right because they read the image, not just the story. That makes them a harder and more meaningful test than typical benchmark questions. As Dr. Topol put it: “If there’s a model that picks up something like this with a de minimis vignette, that’s great. That’s a sign of superiority. And if you see that consistently across many different images, that would be great.”
Working in this space? The cases above are drawn from Mercor&#39;s medical multimodal datasets and are harder versions of benchmarks like HealthBench, MedAgentBench, MultiMedBench, MedXpertQA, and MedXpertQA-MM, designed to find where your model breaks. Or we&#39;ll build a custom dataset around your exact failure modes. <a href="https://www.mercor.com/partner/" target="_blank" rel="noopener noreferrer" class="text-indigo-600">Reach out to us today</a>.

Saumya Chauhan

Member of Technical Staff

Beepul Bharti

Siva Vallabhani

We tested six frontier models on clinical cases. When the patient story contradicts the image, the story wins. Why benchmark accuracy misses the failure.

Frontier AI Models Misread Medical Images | Mercor

The Image Was Normal. AI Saw a Heart Attack.

Mercor is on the <a href="https://www.forbes.com/lists/ai50/" target="_blank" rel="noopener noreferrer" class="text-indigo-600">Forbes AI 50</a> for the second year in a row. It is an annual list of the leading private AI companies, produced with Sequoia and Meritech. We went from $0 to $500M run rate in 17 months, crossed $1B in annualized revenue run rate earlier this year, and we pay out over $2 million every day to more than 30,000 weekly active contractors across the world.
Growth like this is a privilege and a stress test. Moving from $200K to over $14M a week on the platform in a short timeframe exposed every place where our early-stage processes, data models, and controls hadn&#39;t kept up with our scale. Some of those gaps we closed on our own terms. Some of them we closed after they bit us. We’ll talk about both kinds of lessons.
The job from here is to build through the next 50x. This is more than just an engineering challenge; it is fundamentally a human one: there are not many engineers in the world who have built systems that move billions safely at this kind of growth rate, and we need more of them. What follows is the record of what we&#39;ve built and where we&#39;re heading. 

If that below sounds like your kind of work, we&#39;d like to talk.
View the role at <a href="https://www.mercor.com/careers" target="_blank" rel="noopener noreferrer" class="text-indigo-600">mercor.com/careers</a> 
<h1>What Mercor&#39;s Payments System actually does</h1>
Mercor is a two-sided marketplace at the forefront of AI; we bill leading AI labs and enterprise clients for access to the best global talent, and we pay independent contractors everywhere for their expertise and work.
Our payments infrastructure has to do two things, simultaneously: invoice clients across multiple, complex billing structures and contractual terms while paying out thousands of experts across dozens of countries and currencies. We have to get this right, all the time, every time. Each of those transactions touches a chain of systems that all have to agree: contracts, work tracking, billing logic, payout orchestration, ledgering, and reconciliation. When any layer in that chain disagrees, someone doesn&#39;t get paid. At $2M a day, it&#39;s a very real existential and reputational risk to the entire company.
It&#39;s worth pausing on what &quot;someone doesn&#39;t get paid&quot; actually means. These aren&#39;t anonymous transactions. They&#39;re a software engineer in Atlanta who depends on Wednesday&#39;s payout to cover rent. A contractor in São Paulo who has no visibility into why their earnings haven&#39;t cleared. If we fail a payment, we&#39;re deeply affecting someone&#39;s livelihood. That weight is something the payments team at Mercor carries deliberately, and it&#39;s a big part of why we treat reliability, auditability, and transparency as first-class obligations rather than as things we&#39;ll get to later.
The team building this sits at an unusual intersection in the company: the breakneck pace, intensity, and ownership culture of a hyper-growth AI startup, with the transaction volume, &quot;always correct,&quot; and compliance responsibilities of a scaled fintech. Those two cultures don&#39;t naturally coexist, and building a team that holds both is part of the work. Our payments and financial infrastructure group brings experience from Scale AI, Google, Airbnb, Robinhood, Two Sigma, Doordash, and Amazon: people who have built and scaled financial systems through exponential growth before. They are people who joined Mercor precisely because the problems here are hard in interesting ways, and because getting them right matters.
<h1>Five things we learned building payments at speed</h1>
<h2>1. A ledger is a first-class primitive, not a side effect</h2>
When you&#39;re moving fast, it&#39;s tempting to treat your ledger as a byproduct of your service: a table that simply records what happened. That framing creates a ceiling you don&#39;t notice until you&#39;re already past it.
A proper ledger must be a resilient, independently searchable source of truth, and natively integrated across your end-to-end operations. When it isn’t – when billing logic, payout state and revenue recognition are all entangled in a single mutable data structure, the system works fine at low volume and starts to actively resist you at scale. Changes become high-risk. Debugging requires holding the whole system in your head. Finance can&#39;t close cleanly at period end.
<img src="https://cdn.sanity.io/images/h6s14f4z/production/f1c0af18eb0ea016ec673505d29ccd17c111feb5-1600x900.png" alt="payments-1" style="width: 100%; border-radius: 10px;" />
The architectural shift we&#39;re making at Mercor is a move from mutable, monolithic payment records to an immutable, append-only ledger: every financial event is written once and never updated with a canonical state machine (accrued → approved → payable → paid → reversed) that makes the current state of any payment, readable and reconcilable at any point in time. This is the foundation for accurate revenue recognition, clean ERP integration, and the kind of auditability that a Series C company with our transaction volume requires.
<img src="https://cdn.sanity.io/images/h6s14f4z/production/7d1dd284844deef6792c2731b04cb4efc50b7162-1600x1198.png" alt="payments-2" style="width: 100%; border-radius: 10px;" />
The ledger also connects downstream to every system that needs to read financial state: revenue reporting, data analytics treasury forecasting, expert earnings visibility, and the audit trail. Building it right means all of those systems get reliable data without each team maintaining their own bespoke view of financial truth.
<h2>2. Manual processes are risk surface, not just tech debt</h2>
Early-stage teams make a rational bet when they operate something manually: it&#39;s faster to ship, and automation can come later. That bet has a time limit, and the clock moves way faster than you expect.
What makes it insidious is what happens in the intervening period. Engineers build on top of the manual process, designing around it rather than replacing it. Operations start treating it as normal. Workarounds accumulate. Runbooks get longer. New engineers onboard and inherit the process as if it were intentional.
One of the most instructive examples is payout dispatch. Weekly payout dispatch started as a reasonable script that one engineer could run in a few minutes. Over time it grew into a multi-hour orchestration involving manual SQL validation, cross-referencing forecasting numbers across multiple payment providers, treasury balance confirmation, and a group sign-off ritual before anything moved. At millions of dollars a week, the process was absorbing 10+ engineering hours per week - time that could be spent scaling and automating systems. Correctness depended entirely on the specific person running it and the conditions under which they ran it. Nobody intentionally designed it that way. It just happened.
The solution isn&#39;t just to automate for efficiency&#39;s sake, but to make the system&#39;s behavior deterministic. This is the goal of the fully automated payout dispatch system we&#39;re shipping. Zero human operational oversight required for the weekly cycle, knowing exactly who, when, and how much to pay, with intelligent monitoring and alerting that surfaces anomalies proactively rather than relying on human spot-checks.
Every manual step is a potential failure mode, and the longed you wait, the more it costs. Every month a manual process runs in production, another layer of complexity grows on top of it. The time to replace it is always earlier than it feels. The sooner you treat manual processes as temporary scaffolding rather than permanent infrastructure, the sooner you can build the safety systems that let you move faster, not slower.
<h2>3. Automation requires canonical data</h2>
A critical, yet often underestimated, principle in payments engineering is this: automation doesn&#39;t lower risk if the underlying data is unreliable; it merely amplifies existing risks. A manual process running on unreliable data might generate a few errors weekly, but an automated one will produce thousands.
At Mercor, task-based payments proceed through a complex series of systems: contracts, work tracking, ingestion, pricing, ledgering, and invoicing. This process only functions smoothly if the output from each stage is understood by the next. Our core challenge stemmed from that fact that upstream data sources were never designed to be financially authoritative. Task statuses were vague, transitions were non-deterministic, and some crucial data resided completely off-platform.
This data deficiency can manifest in two ways:
<ul><li>Inaccurate Bonus Payouts: When a contractor completes a task, the project lead records completion in a manual, off-platform spreadsheet and must manually issue the corresponding bonus. This can lead to forgotten bonuses or mistyped amounts. Crucially, the source of truth for a contractor&#39;s earnings becomes a hand-edited cell in a spreadsheet.</li><li>Inefficient Invoicing and Reporting: Before every billing cycle, our accounting team spends a significant time manually verifying payment amounts by cross-referencing disparate upstream tables. This data might be scattered across different tools and data sources. The necessary solution is to establish a single, canonical definition of &quot;task complete&quot; that every tool adheres to and every downstream system trusts.</li></ul>
We treat upstream data architecture as a first-class citizen of our payments infrastructure, rather than agnostic input. This requires establishing and enforcing strict, canonical work completion signals across the platform: deterministic triggers that every tool must adhere to and every downstream system can trust without manual validation. From there, the end-to-end chain of work tracking, ingestion, pricing, ledgering, and invoicing runs seamlessly and automatically.
<h2>4. Observability investment compounds early</h2>
One of the best decisions the team made early was to invest in continuous monitoring, alerting, and payments tracking.
Real-time payout status, unified earnings views, and forward payment forecasting aren&#39;t the most glamorous infrastructure work (and most of the company doesn&#39;t know it&#39;s there unless it breaks), but they have a measurable impact on the thing that matters most in a payments system: trust. Experts depend on Mercor to provide reliable cashflow and want to make sure they get paid for their hard work. We invested early in automated forecasting and reliable payment status data, and surfaced this to our experts via a redesigned Earnings dashboard. Today, Earnings is the third most visited page on the Mercor Work Platform. Since launching these features, payout-related support tickets have dropped significantly.
On the operational side, we built centralized monitoring across our payment providers, tracking payout success rates, provider-specific failure modes, treasury balances, and reconciliation gaps, with structured alerting that catches anomalies before they become incidents. When something goes wrong, the team can trace it in minutes rather than hours.
Building observability into the system early and treating it as core infrastructure rather than a nice-to-have is what gave the team the feedback loops needed to move faster as complexity grew.
<h2>5. Controls aren&#39;t a compliance checkbox. They&#39;re load-bearing infrastructure.</h2>
At Mercor, we&#39;re building toward a world where many of our operations can run autonomously, where agents can operate projects, dispatch payouts, generate invoices, and close the books.
This comes at a tradeoff. Manual processes have one advantage: a human in the loop. The operator entering a discretionary bonus catches a typo before it ships. The engineer running the weekly payment dispatch can notice if the numbers look off. When you automate, these human controls go away, so the controls that replace them have to be well-designed, because there&#39;s no one left to catch what the system misses.
We&#39;ve invested in applying agentic controls across our billing orchestration. We implemented a two-phase commit pattern across payment-triggering actions: every action first creates a pending record with a complete audit event, including the agentic initiator, amount, billing account, reason, timestamp. The audit trail is written before money moves, never as a side effect after.
On top of that, we built a threshold-based approval routing engine: configurable spend limits with automatic escalation requiring approval above defined thresholds. Large payments get flagged proactively and require explicit sign-off before they execute. We also invested in an aggregation service that monitors total spend and bonus activity across the platform in real time, so the system itself knows when something is trending toward an outlier before anyone has to ask.
The most effective controls are structural ones. By creating a system where it is architecturally impossible to issue a payment without a threshold check or an audit record, we enable automation to scale safely. Because proactive safeguards are built directly into the core platform, agentic functions can manage tasks like orchestration, dispatch, and bonus approvals while remaining strictly governed.
<h1>What we&#39;ve shipped</h1>
Since October, the team has shipped:
<ul><li>Automated treasury forecasting: Automated forecasting of our float eliminated overdraft risk and shifted treasury operations away from engineering, establishing a clean separation between financial operations and product development.</li><li>Expert earnings visibility: Real-time payout status, unified earnings views, and forward payment projections, now the third most visited page on the Work Platform, with a measurable reduction in &quot;where is my money&quot; support tickets.</li><li>Billing canonicalization: Billing Accounts introduced as the primary financial primitive, linking projects and contracts to financials and establishing Finance as the owner of revenue recognition.</li><li>Centralized observability: Real-time KPI dashboards, structured alerting across payout providers, and incident runbooks that have significantly reduced production incidents and mean time to resolution.</li><li>100% bonus auditability: Immutable audit records on all on-platform bonus activity. Every payment now carries initiator attribution, billing account, reason code, and a complete approval history. Two-phase commit architecture ensures the audit trail is written before money moves, not after.</li><li>Time agnostic billing: Migrated billable data to UTC, eliminating a class of timezone-related billing errors that had real cost implications at scale.</li></ul>
These represent a shift from a payments system that was outrunning its infrastructure to one with the controls, visibility, and data quality needed to scale further.
<h1>What we&#39;re building</h1>
The foundation is almost in place. The next phase is where it gets interesting.
Correctness as a primitive: Shipping ledger-backed payouts, where all money movement is written and executed from a universal, immutable, append-only ledger with absolute correctness. That means zero double payouts and a payout failure rate below 0.01%.
Canonical payment state machines: Decomposing the current tightly coupled orchestration into a state-based billing model that cleanly separates what is owed from how money moves. This is the unlock for faster development, safer launches, and future capabilities including multi-instrument wallets, accruals, and dynamic pricing.
Completely autonomous payout dispatch: Fully automated weekly payout execution with zero human operational oversight required. The target is 95%+ of payouts executed at or ahead of SLA, with zero orchestration regressions.
Human data tasks as a native billing construct: Native task-based payouts and invoicing based on canonical task completion signals, targeting automation of 90% of task-based payment activity.
Budgets and approvals: Threshold-based approval routing for all non-billable spend, with configurable per-expert and per-account thresholds, automated escalation tiers, pattern-based fraud detection, and real-time spend dashboards for Finance and account owners.
Plug-and-play and self-service: Shifting payment operations away from a reactive model by enabling Finance, Ops teams and AI agents to proactively manage payouts, fixups, and disputes through self-service tooling with end-to-end audit trails and role-based permissions.
The scale is already exceptional. The engineering to match it is what comes next.
<h1>Come build with us</h1>
Mercor is growing faster than most companies ever will, and the payments system is where a lot of that growth has to be absorbed. Some of what we&#39;ve built, we built because we saw it coming. Some of it we built because we had to. All of it is load-bearing, and all of it needs to get better.
We&#39;re building a team of exceptional engineers who know what scale looks like and will build the financial backbone that supports the next phase of growth. This includes automating payout orchestration, building an immutable operational ledger, designing settlement and reconciliation pipelines, improving end-to-end observability across payout to processor to bank, hardening controls and fraud detection across the marketplace, and decoupling processor dependencies from core infrastructure.
This is a place for people who care deeply and build boldly, who want the pace and ownership of a hyper-growth startup with the rigor of a fintech, and who understand that at our volume, those two things are the same job. You&#39;ll work in-person with a team of ambitious, diverse people on problems that matter, with direct exposure to the research, labs, and startups at the AI frontier. The problems are genuinely hard, the stakes are genuinely real, and the people working on them are genuinely exceptional.
If that sounds like your kind of work, we&#39;d like to talk.
<a href="https://www.mercor.com/careers/?ashby_jid=3c02ed8c-3807-4aef-91cc-ccca5ab578bd" target="_blank" rel="noopener noreferrer" class="text-indigo-600">View the role at mercor.com/careers →</a>
Mercor is a profitable Series C company valued at $10 billion, trusted by 6 of 7 Mag 7 companies. We work in-person five days a week in San Francisco.

Derek Shimozawa

When you go from $2 million a month to $2 million a day

<h1>AI failures modes when we pushed frontier models on real finance tasks</h1>
Last fall, the <a href="https://www.wsj.com/lifestyle/careers/harvard-mba-employment-rate-job-hunt-difficulty-addfc3ec?gaa_at=eafs&amp;gaa_n=AWEtsqf1PmQ1Yq1aTs3z9LTYnu1_AegFki8vZ0MyZhTdQrixv2rpDt7kSIUhpxNTc_E%3D&amp;gaa_ts=69bda6fa&amp;gaa_sig=-zLLJuGtICc0Nmmr0lpTyUJ2zREy7JXPE8uGf7bqtHqB1eJog-ocoGVqACQhc1clZRV1t4w9B_nvsR_FOoLnow%3D%3D" target="_blank" rel="noopener noreferrer" class="text-indigo-600">Wall Street Journal </a>reported that 23% of job-seeking Harvard MBAs were still looking for work three months after graduation. Then, last month, Anthropic published <a href="https://www.anthropic.com/research/labor-market-impacts" target="_blank" rel="noopener noreferrer" class="text-indigo-600">research on labor market impacts of AI</a> that placed financial analysts among the ten most exposed occupations to AI displacement.
The anxiety is understandable. But we think it&#39;s premature.
We&#39;ve stress-tested frontier AI models on financial reasoning tasks that are representative of real work in earnings analysis, deal evaluation, and investor decks. What we found suggests a meaningful gap between how these models perform on standard benchmarks and how they perform when you hand them complex, multimodal inputs that real investors work with everyday.
When you give a model real-world finance inputs that combine charts, graphs, and images, instead of typed-out numbers, accuracy diminishes substantially. GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6 consistently fail in two ways: misreading values from dense visual documents, and applying the wrong financial operation even when the inputs are correct.
<h2>The setup</h2>
We constructed 25 tasks based on real financial documents: earnings reports, investor presentations, roadmap slides, and regulatory fee schedules. Each task requires identifying specific numbers from a document and performing a financial calculation - a margin, a growth rate, a dilution percentage, a ratio. Each task has a single correct numerical answer so scoring is unambiguous: pass or fail.
We started with the original image of the document page (image-only) and then constructed a text-only version by writing out the information from the image in free text. This lets us separate two failure modes that standard benchmarks conflate: can the model do the math? versus can the model read the document?
We tested three frontier models: GPT-5.4, Gemini 3.1 Pro, and Claude Opus 4.6 across both variants, for 50 evaluations per model (25 tasks × 2 variants). Each model receives the same prompt with the same evidence. If the model’s final numerical answer falls within a defined tolerance of the ground truth, it passes.
<h2>What we found</h2>
<h3>Models need to read the document to reason correctly</h3>
Before running the full eval, we ran a sanity check: we gave each model only the question with no other sources, and asked it to answer from parametric knowledge alone.
The results are decisive. Across the 25 tasks, Claude Opus 4.6 answered 1/25 correctly (4%), GPT-5.4 answered 1/25 (4%), and Gemini 3.1 Pro answered 0/25 (0%). Claude Opus 4.6 and Gemini 3.1 Pro only passed one task, task_136 (shelter’s CPI contribution ratio = 3.0×), which is a small integer answer that two models guessed correctly by chance.
This demonstrates that the benchmark is genuinely testing document reasoning, not recall of memorized financial figures.
<h3>Models perform better on text than images</h3>
When provided with clean extracted text, model performance is credible: text-only accuracy ranged from 72% (GPT-5.4) to 80% (Gemini 3.1 Pro), with Claude Opus 4.6 at 76%. When provided with only the document image, accuracy dropped to between 56% and 64%, a decline of 16 percentage points for Gemini 3.1 Pro and GPT-5.4, and 20 percentage points for Claude Opus 4.6.
The text-to-image degradation is strikingly consistent: -20pp for Claude Opus 4.6, -16pp for Gemini 3.1 Pro, and -16pp for GPT-5.4. They point to a general weakness in even frontier models. Visual extraction from real financial documents is a bottleneck for every frontier model, not a quirk of any single one.

<img src="https://cdn.sanity.io/images/h6s14f4z/production/345d94c88b3c288ae55a2018652eaa3f39638c00-2400x1086.png" alt="Model Accuracy Summary - Text Only vs Image Only" style="width: 100%; border-radius: 10px;" />
<img src="https://cdn.sanity.io/images/h6s14f4z/production/abf65159313a04b11c0dd601f1bfb9b8af5ead41-2400x1293.png" alt="Text → Image Degradation by Model" style="width: 100%; border-radius: 10px;" />
<h3>The same task, different results</h3>
The clearest illustration of the text-vs-image gap comes from task_138, a Fidelity Rising Wedge pattern task. The question asks for the dollar difference between the upper and lower trend lines at the entry point.
<img src="https://cdn.sanity.io/images/h6s14f4z/production/14b3950bb1e65047e09c53224ecab54aed388a8c-2400x1503.png" alt="Text vs. Image: Same Task, Different Results - task_138 Fidelity Rising Wedge Entry" style="width: 100%; border-radius: 10px;" />
In the text-only condition, all three models answer correctly ($4.00). In the image-only condition, only Gemini 3.1 Pro gets it right. Claude Opus 4.6 reads the wrong anchor point and returns $2.00. GPT-5.4 lands just outside tolerance at $4.30. The model knew exactly how to compute the final value but it couldn’t reliably read the value off the chart.
<h2>Two failure modes drive the collapse</h2>
When we dug into why image-only accuracy drops so consistently, two patterns emerged:
<img src="https://cdn.sanity.io/images/h6s14f4z/production/3e9ae2e5a8b6a816ee0d006e61ea11f25fcead49-1264x484.png" alt="" style="width: 100%; border-radius: 10px;" />

Visual extraction is the main cause of why models fail on the image-only tasks. They often anchor to the wrong element in dense charts, especially in documents with multiple graph types on a single page, and pull a plausible but incorrect value when the question does not explicitly specify where to look. This is a real-world problem – usually, models have to identify the relevant region from an image rather than being guided to a specific value.
The reasoning failure is less visible but more informative. Even when models have the correct values in front of them (in the text-only condition, where extraction is not a factor), they sometimes apply the wrong financial operation. For example, returning an absolute difference instead of a percentage change, or inverting a ratio. These are standard calculations, suggesting that the issue is not complexity but how models execute multi-step financial reasoning.
Sometimes, both failure modes appear together. In these dual-failure cases, models first extract the wrong values from the image, and then compound the error by reasoning about those values incorrectly.
<h2>Why this matters</h2>
Standard AI benchmarks don’t represent real financial work. In contrast to existing chart and document benchmarks like ChartQA and DocVQA, which often use cleaner layouts or isolate a single visual element, our tasks are drawn from dense, real financial documents and require identifying the correct values before reasoning over them. Yet in practice, investors have to review messy data, like 40-page PDFs with nested tables, multi-panel charts, margin bridges, and footnotes.
Our results suggest that frontier models currently handle the visual extraction step far less reliably than the topline benchmark scores suggest. The industry’s trajectory toward improved visual reasoning is clear. But before the conversation about AI displacing financial analysts goes further, it’s worth asking: what exactly are models impressive at, and under what conditions?
If you’d like to see our full methodology, task specifications, samples, and per-task failure mode analysis, <a href="https://www.mercor.com/apex/contact/" target="_blank" rel="noopener noreferrer" class="text-indigo-600">please reach out here</a>.

Ayushi Sinha

Multimodal AI Lead

Chirag Mahapatra

Member of Technical Staff, Mercor

Abhi Kottamasu

AI failures modes when we pushed frontier models on real finance tasks

AI can't read an investor deck

<h4>Ayushi founded a healthcare AI startup before joining Mercor as a Product Manager. She came looking for a team that understood founder life and problems worth solving at scale.</h4>
Tax season is coming up, and my accountant just reminded me that the last time I had a paycheck was 2022.
Most recently, I was a solo founder building Turmerik, a healthcare AI company. I lived out of a suitcase. I grew a massive LinkedIn presence to have the credibility to cold-DM biotech founders and pray they&#39;d reply. My Notes app became a graveyard of pitch variations, pricing models, and 2 a.m. pep talks I wrote to myself. If you&#39;ve ever been a solo founder, you know the specific loneliness of celebrating a small win and having nobody in the room to high-five.
I ultimately sunset Turmerik. In a lot of ways, it felt like failure. The kind you can rationalize on a LinkedIn post but still cry about at 11 p.m. on a Tuesday.
So when I started thinking about what came next, I was incredibly deliberate. While my friends in NYC were making spreadsheets to rank their Raya dates, I was making spreadsheets to rank companies. I tracked every conversation, every vibe check, every red flag. Same energy, but arguably even higher stakes. You spend more waking hours with your coworkers than your romantic partner. You&#39;ve really got to know what you want.
The problem was, I didn&#39;t, or at least not yet. So I went gorilla trekking in Uganda with my friend Sahir. After three days of staring down silverbacks and thinking about my life in the middle of the jungle, I came back to the States with my list of criteria.
Here&#39;s what I knew I wanted.
<ol><li>I wanted to work with people who got it. Founder life is hard to explain to someone who hasn&#39;t done it. The sleepless nights, the identity crisis when your company is you and then it isn&#39;t anymore. I needed colleagues who understood that language. My manager at Mercor went through YC and ~30% of my colleagues are former founders. They get it.</li><li>I wanted to be somewhere that had already found product-market fit. After years of searching for PMF myself, I wanted to experience what it feels like when the thing is actually working. At Mercor, we pay out $2M a day to experts around the world and work with today’s most important frontier AI companies. There are problems you only get to solve at that kind of scale, and I wanted to be in the room for them.</li><li>I wanted to be AI-native. This was the hardest trade-off, because healthcare isn&#39;t just something I worked in, but it&#39;s something I grew up talking about at the family dinner table, since my parents are doctors. I loved healthcare and was genuinely inspired by the mission, the complexity, and the patients whose lives you might actually improve. But I realized that in 2026, I&#39;d have to choose: work at the frontier on technically interesting problems with massive impact at scale, or massive impact on individual lives. I had to be honest with myself. The technical problems I was spending most of my time on in healthcare weren&#39;t frontier AI. They were data pipeline plumbing. I wanted to work at the edge of what&#39;s possible, not massage CSVs into submission.</li><li>When I looked at my resume, I noticed a pattern: I boomerang between prestige-maxxing and leaps of faith. Microsoft and Princeton, then Nines (unproven but an incredible learning experience). Bain Capital and Harvard Business School, then Turmerik (a crash course in building AI for biotech and pharma, where &#39;move fast and break things&#39; is literally against FDA regulations). The safe move would have been to boomerang back to a big name, collect the logo, and let the brand do the talking. But I&#39;m at the stage of my career where you either get escape velocity toward the exec track or you don&#39;t. And getting there means becoming a real expert in something, or being able to say you drove $XX M in revenue, or that you built the core tech that changed the trajectory of a company. You can&#39;t say any of that if you&#39;re coasting at a place where the biggest risk is choosing the wrong OKR. I wanted to go somewhere where I could say I helped create a step-function change.</li></ol>
So I joined Mercor. And now my life looks extremely different.
I have a desk with a curved monitor. I make small talk in the kitchen, which is a skill I am actively re-learning after years of talking mostly to myself and my laptop. I get to dress up for the office, which turns out to be an underrated perk when your previous work uniform was &quot;whatever was clean in the suitcase.&quot; I have my first 401k and I still need to figure out where to invest it, so if you have suggestions, I&#39;m all ears.
But beneath the jokes, the real thing is this: I&#39;m not lonely anymore. I work with people who are sharp, creative, and moving fast on problems that matter. Mercor is building the infrastructure for how the world will work in the age of AI, and I get to work with a killer team to help shape that. After years of trying to build something from nothing, there is a specific energy in joining a team that&#39;s already sprinting and finding out you can keep pace.
If you&#39;re a former founder wondering whether you can thrive outside of founder mode, I’m here to reassure you that yes, you can survive as a W2 employee. And if you&#39;re thinking about Mercor specifically, come find out what it feels like when the rocket is already off the launchpad and someone hands you the controls.

AI founder turned Mercor team member: why Ayushi left healthcare AI to build at the frontier with a team that knows founder life.

Why I Joined Mercor – Ayushi Sinha

Why I joined Mercor – Ayushi Sinha

At Mercor, we’ve spent years on the frontier working with AI labs to accelerate model capabilities through human expertise. We see firsthand how raw intelligence compounds daily. However, enterprises still haven’t seen that same compounding value in their own AI deployments. What’s the bottleneck?
<h2>Stop guessing about your agents</h2>
Most enterprise agent deployments follow the same pattern: a team guesses at what the agent should do, hand-writes prompts, manually configures tool calls, and hopes it performs as intended. When it doesn&#39;t, they guess again. And again. This is why enterprise AI stalls.
Not because the models aren&#39;t smart enough, but because your agent development approach is backwards.
Here&#39;s the guesswork loop most enterprises are stuck in:
They guess where to start. Companies don’t know which agents to deploy because no one has the full picture. The teams closest to the work know which tasks are tedious, but only see their slice of a cross-functional, multi-step web. The people who own the function see the end-to-end process, but are a step removed from where real friction lives. So someone picks a use case that sounds promising, and the team runs with it, without evidence of where agents will actually create value.
They speculate on how agents should behave. The agents they do build rely on improvisation: hand-tuned prompts, ad hoc tool selection, and configurations based on what one person thinks the workflow should look like — not how the highest performers actually do the work.
They assume agents work as intended. Errors compound because there&#39;s no clear way to tell whether an agent&#39;s output is acceptable, good, or quietly degrading. The bar becomes &quot;it seems to work,&quot; but teams don’t know what’s failing or what could be better. They discover failures weeks later through customer complaints or downstream breakdowns, and spend more time repairing damage than the agent saved.
They improve agents on vibes, not verification. There&#39;s no systematic way to improve agents over time. Updating workflows means painstaking manual reviews, rewriting prompts from scratch, and repeating the whole cycle.
If you&#39;re guessing at any of these steps, you&#39;re guessing about your agents. No amount of model intelligence closes that gap on its own.
<h2>Introducing the Mercor Enterprise AI Platform: Groundwork for your agents</h2>
Breaking the guesswork loop requires the opposite approach. You start by understanding how your work actually gets done. Then, you programmatically convert that understanding into agent behavior, and continuously measure output against quality. The gap between the output and the desired quality becomes the feedback signal that improves the agent over time. Together, these are the groundwork that separates agents that compound in value from agents that stall.
The<a href="https://www.mercor.com/enterprise/" target="_blank" rel="noopener noreferrer" class="text-indigo-600"> Mercor Enterprise AI platform </a>delivers these three core capabilities:
<h3>1. Know where and what to build: understand organizational context</h3>
Solid groundwork starts with evidence. We map your workflows by understanding how your people actually work — screen-level workflow capture, extraction from internal wikis and application system logs, and AI-led employee interviews that surface the institutional knowledge living in people&#39;s heads. The judgment calls, the edge-case handling, the &quot;we do it this way because of X&quot; reasoning that never makes it into documentation.
We capture all of it as structured, machine-readable data, so you know exactly where agents will create value and what they need to do it well.
<h3>2. Teach agents how to behave: programmatically translate context into agent behavior</h3>
Understanding context is only valuable if you can systematically convert it into agent behavior. We programmatically translate organizational context into agent behavior specs and quality guardrails. Instead of hand-tuning prompts, we generate a robust set of evaluations and success criteria from how your top performers work, and then automatically optimize agent behavior against them. Agents are deployed in hours rather than days, and at a higher baseline quality than manual specification can achieve.
<h3>3. Learn continuously: measure agent output against defined standards</h3>
Agents without a quality signal fail silently, and fixing them is manual. We catch errors in real time using quality verifiers that compare every output against your organization&#39;s definition of &quot;good.&quot; When an agent&#39;s confidence is low, we flag it for human review before it causes damage. When corrections come in — from automated detection or human feedback — they feed directly back into the agent&#39;s behavior specs and organizational context.
Every failure becomes a permanent improvement.
This closes the remaining gap: you stop guessing whether agents are working and how to fix them — because quality is measured continuously and corrections are systematic.
<h2>Tools built on frontier AI infrastructure</h2>
This platform evolved from the same technologies we&#39;ve pioneered to help the world’s leading frontier AI labs organize and structure human expertise to teach their models. The tools we use to capture expert behavior, build evaluation criteria, and systematically close performance gaps have been tested at scale across the most advanced AI development pipelines.
Now we apply that same technology to your organization.
We deliver the groundwork through a set of discrete, composable tools designed to scaffold your agents on any orchestration platform:
Organizational context graph: Structured map of how your organization actually works, built from data integrations with your organization’s systems and tools (e.g. ticketing platforms, CRMs, communication tools, codebases, knowledge bases, internal docs), application traces, team workflow capture, and AI-moderated interviews that surface institutional knowledge in people’s heads.
Agent Specification Engine: Human-readable, machine-executable agent specs translated from your context graph. Instructions are derived from how your best people actually work.
Quality Guardrails: Task-output scoring functions that compare agent output against expert-defined quality thresholds and automatically gate outputs that fall below your standards for human review.
Continual Learning Harness: Machine- and human-led feedback loops that automatically update agent behavior based on evals — task objectives, agent behavior, and output quality scoring functions — that pinpoint where and why your agents fail. Every correction makes the system better.
Agentic Workflow Data: Proprietary datasets of completed tasks and reasoning traces across economically valuable domains — law, finance, HR, accounting, software engineering, market research, competitive analysis, and more — so your agents don&#39;t start from scratch.
The modularity means we can deploy the right combination of components for your specific challenge — capturing context for a single workflow, deploying quality guardrails on agents you&#39;ve already built, or running the full pipeline end to end.
<h2>Why Mercor</h2>
Three things set us apart.
We&#39;ve proven this with frontier AI labs. We are a leading partner to frontier labs for organizing and deploying human expertise to improve model performance. The infrastructure we use to capture expert behavior, build evaluation criteria, and systematically close performance gaps is battle-tested at billion-dollar scale. We understand what actually matters for strong agent performance at a deeper level than anyone else — because we&#39;ve been defining and measuring it for years.
We&#39;ve deployed this for ourselves. Mercor runs highly sophisticated, self-improving agents at scale — including AI-driven hiring processes that evaluate millions of candidates for our expert network. This isn&#39;t theoretical. We&#39;ve iterated these systems to a level of quality that our competitors cannot demonstrate.
We understand self-learning. We are the best in the world at teaching models to improve using benchmarks. We’ve spent years working with labs and developing benchmarks like <a href="https://www.mercor.com/blog/introducing-apex-agents/" target="_blank" rel="noopener noreferrer" class="text-indigo-600">APEX</a> to measure whether AI models can perform economically valuable work in law, finance, and software engineering. Through that work, we’ve built technology that pinpoints exactly where and why an agent fails. We deeply understand the common patterns behind agent failures: confusing implications with grounded facts, insufficient output verification, incomplete tool access, and more. That same expertise and technology is the foundation of everything we’ve built. Without measurable, specific corrections, no amount of context helps. Your agent will keep making the same mistakes.
Your best people become coaches to the agents they work with every day. The agents remember and operate by your gold standards, continuously learning from real evidence of what great work looks like.
Build on groundwork, not guesswork.

Introducing the Mercor Enterprise AI Platform. Build AI agents designed for your company. 

Introducing Mercor Enterprise AI

Introducing <a href="https://arxiv.org/pdf/2601.08806" target="_blank" rel="noopener noreferrer" class="text-indigo-600">APEX-SWE</a>, a new benchmark created in collaboration with <a href="https://cognition.ai/" target="_blank" rel="noopener noreferrer" class="text-indigo-600">Cognition</a>. It measures whether frontier AI models can handle real software engineering work – shipping systems, diagnosing failures, and implementing fixes.
Our results show that even top performing AI models hit a wall when tasked with the complexities of real-world software engineering. At the time of release, the top score on the APEX-SWE leaderboard is GPT-5.3 Codex at 41.5% Pass@1, leaving plenty of room for hillclimbing.
Cognition was critical to making this benchmark reflect the expectations of software engineers. They reviewed a subset of integration and observability tasks, created by SWEs hired through Mercor’s platform, pressure-testing how real production systems actually break and get fixed. The resulting tasks reflect the expectations agents must meet to be useful in real software engineering work.
APEX-SWE builds on Mercor’s family of benchmarks for evaluating AI models at economically valuable work, including <a href="https://www.mercor.com/apex/apex-agents-leaderboard/" target="_blank" rel="noopener noreferrer" class="text-indigo-600">APEX-Agents</a>, the <a href="https://www.mercor.com/apex/apex-v1-leaderboard/" target="_blank" rel="noopener noreferrer" class="text-indigo-600">AI Productivity Index (APEX)</a>, and the<a href="https://www.mercor.com/apex/ace-leaderboard/" target="_blank" rel="noopener noreferrer" class="text-indigo-600"> AI Consumer Index (ACE)</a>.
<iframe src="https://player.vimeo.com/video/1176599561" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write; encrypted-media; web-share" referrerpolicy="strict-origin-when-cross-origin" title="Introducing APEX-SWE | Mercor x Cognition" class="mb-8 mt-4 aspect-video w-full max-w-[1080px] rounded-lg bg-gray-100 object-cover sm:mt-6"></iframe>
AI coding models and assistants are now a core part of software engineering, with recent industry reports indicating that over 90% of developers use AI coding assistants. Nearly half of all code at major technology companies is AI-generated.
Traditional coding benchmarks have become saturated. GPT-4 has improved from 67% to 90% on HumanEval in just two years, and the most recent Opus models consistently score over 75% on SWE-bench Verified. OpenAI has declared some <a href="https://openai.com/index/why-we-no-longer-evaluate-swe-bench-verified/" target="_blank" rel="noopener noreferrer" class="text-indigo-600">SWE benchmarks contaminated</a>, with models able to reproduce original patches verbatim from task IDs alone.
However, even before saturation, these benchmarks presented a misleading picture of AI models&#39; real-world coding ability. According to <a href="https://my.idc.com/getdoc.jsp?containerId=US53204725" target="_blank" rel="noopener noreferrer" class="text-indigo-600">IDC</a>, developers spend only 16% of their time writing code and building new features. The remaining 84% involves CI/CD, infrastructure monitoring, deployment, and debugging.
Professional software engineering extends far beyond writing short functions or patching a single file. Real production environments involve cross-platform integration, infrastructure provisioning, and debugging production failures with incomplete information.
<h2>Leaderboard results</h2>
All of the models we evaluated fail to reliably solve the real-world production software engineering tasks in APEX-SWE.
GPT-5.3 Codex (High) tops the leaderboard at 41.5%, followed by Opus 4.6 (High) at 40.5% on Pass@1 and Opus 4.6 (High) at 38.7%.
<img src="https://cdn.sanity.io/images/h6s14f4z/production/9d25202271de2ab93eb4fa34f7684504589733d7-4470x2825.png" alt="Performance of Models on APEX-SWE Leaderboard" style="width: 100%; border-radius: 10px;" />
APEX-SWE is split evenly between two task types that reflect real-world software engineering work:
<ul><li>Integration tasks, which require constructing end-to-end systems across heterogeneous cloud primitives, business applications, and infrastructure-as-code services.</li><li>Observability tasks, which require debugging production failures using telemetry signals such as logs and dashboards, as well as unstructured context.</li></ul>
Models perform better on Integration tasks than Observability tasks. For Integration, Claude Opus 4.5 (High) and GPT 5.4 (High) lead at 50.7%. Observability scores are lower overall, with GPT 5.3 Codex leading at 33.3% Pass@1. For both task types, top-performing models demonstrate strong capabilities but still do not meet the production bar.
For both task types, when models succeed they demonstrate epistemic reasoning—treating their generated code as a hypothesis that must be tested against the actual system state before being finalized. This iterative and careful approach results in more successful code executions as problems are overcome through debugging.
<img src="https://cdn.sanity.io/images/h6s14f4z/production/1f232db8cee704cb0b4f6a2598e2a26abd8d76d0-3560x2247.png" alt="Model Performance Comparison - Integration vs Observability" style="width: 100%; border-radius: 10px;" />
<h2>Our process</h2>
Each integration task contains an environment for the agent to operate it. They all include an ephemeral PostgreSQL database and Plane, as well as six other services: LocalStack (56%), which emulates AWS primitives such as S3, Lambda, DynamoDB, and Kinesis, EspoCRM (35%), MailHog (33%), Mattermost (32%), Medusa (31%), and Zammad (26%). Tasks were created by software engineers with 3+ years of experience who ran validation tests and created gold standard outputs.
<img src="https://cdn.sanity.io/images/h6s14f4z/production/41f64b98bc44cbeb18e17a4f2ad2514034d0fb01-4426x1788.png" alt="Integration task process" style="width: 100%; border-radius: 10px;" />
Each observability task deploys a containerized environment orchestrating five services: a client workspace, Loki and Promtail for log aggregation, Grafana for visualization, and Plane/Mattermost for ticket and chat context. Engineers scripted synthetic logs (500 to 1,000 lines of normal operation mixed with 10-20 lines of bug symptoms) and chat history to replicate a production failure, as well as Dockerfiles, task metadata, and patches. Tasks were derived from real-world GitHub Issue-PR pairs, sourced from repositories with at least 350 stars that we filtered for complexity and stability. Observability tasks are distributed across five widely-used languages: Go (30%), Python (25%), TypeScript (25%), Java (10%), and C++ (10%).
<img src="https://cdn.sanity.io/images/h6s14f4z/production/52b314ca7c2bd9a13063f4c83aa654abef28bd82-4606x1788.png" alt="Observability task process" style="width: 100%; border-radius: 10px;" />
<h2>Open source</h2>
We have released an <a href="https://huggingface.co/datasets/mercor/APEX-SWE" target="_blank" rel="noopener noreferrer" class="text-indigo-600">open-source dev set via Hugging Face</a> (n=50) with a CC-BY license, and our eval harness is available as an <a href="https://github.com/Mercor-Intelligence/apex-swe" target="_blank" rel="noopener noreferrer" class="text-indigo-600">open-source repo on GitHub</a>. The full leaderboard comprises n=200 tasks that are heldout and hidden, with a similar distribution to the open-source set. Read <a href="https://arxiv.org/pdf/2601.08806" target="_blank" rel="noopener noreferrer" class="text-indigo-600">the APEX-SWE technical report</a>.
We thank all the software engineers on the Mercor marketplace who contributed their time to creating APEX-SWE.

Ben Pan

Member of Technical Staff, Cognition

Sam Lee

Silas Alberti

Founding Team Member, Cognition

Introducing APEX-SWE, a new benchmark created in collaboration with Cognition, measuring whether frontier AI models can handle real software engineering work.

Introducing APEX-SWE | Mercor x Cognition 

Introducing the AI Productivity Index for Software Engineering

<h4>Amresh spent nearly a decade advising on AI strategy at McKinsey. He joined Mercor as Director of Operations to build the data infrastructure that actually moves model performance.</h4>
I spent nearly a decade at McKinsey, most recently as an Associate Partner in the Tech practice in San Francisco. I worked with some of the best minds in the industry, sharpened my thinking on AI strategy, and built relationships I&#39;ll carry for the rest of my career. But somewhere along the way, I started craving something different. I wanted to build, not advise. To own outcomes, not just recommend them.
When Mercor came across my radar, the vision was immediately compelling: become the defining talent platform for the AI economy. But what really made the decision easy was a conversation with Brendan. His passion was obvious, but what struck me more was his maturity and clarity of thinking. We were aligned on what we were trying to do - build something genuinely legendary. That kind of founder conviction is rare, and it changes what&#39;s possible.
The timing couldn&#39;t have been better. Over the past year, I&#39;ve watched the AI market shift in a profound way. As frontier models push toward higher-order reasoning and agentic capabilities, the demand for specialized human expertise has accelerated dramatically. Better models need better training data, and that means better experts. Mercor sits right at the center of that dynamic.
Our role is becoming increasingly strategic. We&#39;re helping customers think through their data collection approaches, shaping what &quot;expert-in-the-loop&quot; actually looks like in practice, and growing fast. In my first year, I&#39;ve had the chance to grow accounts to nine-digit run rates and build a team of more than 20 people. I&#39;ve also learned more about AI research than I ever expected; specifically, what kinds of data actually move the needle on model performance. It&#39;s been the steepest and most rewarding learning curve of my career.
What grounds all of that, for me, is the impact on our network of experts. I’ve heard many stories about their motivations for joining Mercor: people who want flexibility, who&#39;ve hit a rough patch, or who simply want to put their expertise to work on their own terms. Now, we’re paying out $2 million to experts every day. It’s clear to me that this new category of work is accelerating as quickly as the AI economy itself is.
The thing that stood out to me early on, and still does, is the talent density. The team skews young, we work hard (yes, including Saturdays), but the culture is genuinely collaborative and down-to-earth. Decisions get made fast and in the open. Our weekly all-hands is a good example: substantive and informative, but also adds levity. We&#39;ve had rap performances, a gumbo band, investors, and experts all show up to share their perspectives. We also highlight experts who share their experience with Mercor and provide the team real-time feedback. It’s a moment to be proud of the impact, but also a reminder there’s always more we can do. It feels like being in on something.
I&#39;ve also been surprised by how much room there is to shape things beyond your immediate role. From organizing weekly happy hours to having direct conversations with the founders about career progression and culture, the scope for impact is real. At McKinsey, influence moved slowly and through layers. Here, if you have a good idea and the drive to push it, things actually change.
Looking ahead, the opportunity feels almost hard to overstate. We&#39;re already one of the fastest-growing companies of all time, and the market headroom is enormous. That growth brings real challenges — some exciting, some humbling — but that&#39;s exactly what makes this moment worth being part of.
If you&#39;re considering Mercor and wondering whether to make the leap: do it. It won&#39;t always be smooth, but I genuinely believe many of us will look back on this period and feel proud that we helped steer something important at a critical moment in how humanity works, learns, and builds. Check out our open roles: <a href="http://www.mercor.com/careers" target="_blank" rel="noopener noreferrer" class="text-indigo-600">www.mercor.com/careers</a>

Amresh spent nearly a decade advising on AI strategy at McKinsey. He joined Mercor to build the data infrastructure that actually moves model performance.

Why I Joined Mercor – Amresh Subramaniam

<h4>Anish Bathwal spent years as a consultant at McKinsey before joining Mercor to be closer to the answer every company was asking: &quot;how do we use AI?&quot; </h4>
What does a consultant actually do?
The honest answer is we get dropped into rooms where the problem isn’t even defined yet and told to figure it out. A software company wants to embed AI into a product that’s been built the same way for fifteen years. A university is watching enrollment decline and can’t tell if it’s the cost, the experience, or the competition. Nobody agrees on what’s wrong, let alone what to do. The job is to synthesize ambiguity into something actionable—fast.
I spent my time in consulting doing this across software and education. The industries were different, but by the time I left, every client was asking the same question: how do we use AI? The urgency was striking—organizations that had nothing in common were all hitting the same wall. But the bottleneck was never buy-in. It was never implementation. It was performance. The models weren’t good enough on the problems that actually mattered—the domain-specific work where expertise couldn’t be faked.
No amount of consulting could fix that. What could fix it was better data. The kind produced by real professionals with years of work experience or advanced degrees. I started paying attention to how models actually improve: through high-quality, knowledge-intensive training data generated by people who deeply understand their field. The question that stuck with me was deceptively simple. What would it look like to solve the performance problem at the source?
Then I heard from some friends in the Bay Area about a startup called Mercor. They were hiring thousands of lawyers, doctors, financial analysts and even consultants and deploying them on projects for the world’s top AI labs. The thesis resonated immediately: to build next-generation models, you need highly specialized data, and to create that data, you need real experts generating complex tasks that require deep reasoning. You could have domain experts create evals that teach a model to replicate their reasoning at scale—not expertise delivered to one client at a time, but expertise encoded permanently. Two days later I was on a flight from NYC to SF.
Now I work on web browsing—training models to persist through complex, multi-step queries rather than stopping at the first result. Think about what that unlocks: a model that can dig through dozens of sources to answer a genuinely hard question, the way a researcher or analyst would. Today’s models quit early. The data we create is designed to teach them not to. PhD-level annotators write research-grade queries at volume, paired with quality frameworks rigorous enough to ensure every task actually moves the needle against benchmarks. Researchers at the lab fold what we produce directly into their models. You can watch the benchmarks move with each model release.
The best AI models a year from now will be built on data that doesn’t exist yet. Someone has to create it, and that requires systems that can channel real expertise into training signal at scale. That’s what we’re building.
For anyone at a similar crossroads—where you’ve gotten good at navigating ambiguity but want your work to compound beyond a single engagement—I’d encourage you to <a href="http://www.mercor.com/careers" target="_blank" rel="noopener noreferrer" class="text-indigo-600">check out our open roles</a>.

Former McKinsey consultant Anish Bathwal on why he left consulting to join Mercor, where expert knowledge is training the next generation of AI models.

Why I Joined Mercor – Anish Bathwal

Every nine seconds, someone starts an interview with Monty, our AI interviewer. That someone might be a software engineer, a banker, a farmer — and none of them are talking to a person. Around 10,000 of these conversations happen every day, each one fifteen minutes long, across hundreds of job categories. Three engineering problems make this hard: keeping every session alive, making turn-taking feel natural, and making sure every candidate gets an interview that&#39;s right for them.
<h2>Keeping every session alive</h2>
Every interview has to work. A candidate who gets a broken session doesn&#39;t get a second chance at that role. Each session runs in its own container on <a href="https://modal.com/" target="_blank" rel="noopener noreferrer" class="text-indigo-600">Modal</a> — spun up on demand, torn down when the call ends. A crash in one container affects one interview and nothing else. In 2025, we scaled from a few hundreds interviews each week to over ten thousand a day. We made no changes to the hosting. What we did change was how we deal with cold starts.
When a candidate clicks &quot;Start Interview,&quot; the session should begin immediately. A fresh container takes several seconds to boot and spin up a video room — long enough that a candidate would notice they&#39;re waiting. We handle this with a warm pool: Modal keeps about 30 containers pre-booted at the compute level, and a background job running every five minutes keeps about 10 fully initialized — room URL registered in Redis, ready to go. When a session starts, it grabs one in well under 200ms. The harder problem is calibrating pool size: too few and cold starts leak through at peak; too many is waste. We track demand by hour and size the pool ahead of it.
The same logic applies everywhere else in the stack. Audio runs on <a href="https://daily.co/" target="_blank" rel="noopener noreferrer" class="text-indigo-600">Daily</a> — Pipecat&#39;s WebRTC layer — handling peer connections, media routing, and cloud recording so recordings land in S3. Speech recognition, the LLM, and text-to-speech each run across a mix of commercial APIs and open-source models, with automatic failover at each stage — any individual outage is invisible to the candidate. Whenever we make a config change, we blue-green it over a week, meaning a bad prompt update gets caught well before it ruins a thousand interviews.
<h3>Container count — typical day (Pacific)</h3>
Figure 1 — Container count by hour (Pacific Time), calibrated to now. Each interview runs in its own isolated container; a warm pool keeps cold-start latency under 200ms. Peak at noon: ~200 containers. Floor overnight: ~80. Drag to explore.
<h2>Getting turn-taking right</h2>
Sounding like a human is hard!
A response that would land naturally at 800ms feels like an interruption at 400ms and dead air at 1500ms. The pipeline runs on <a href="https://pipecat.ai/" target="_blank" rel="noopener noreferrer" class="text-indigo-600">Pipecat</a>, an open-source voice AI framework created by Daily, fully streaming end-to-end: speech recognition runs continuously, turn detection fires as soon as silence is detected, and TTS starts synthesizing on the LLM&#39;s first sentence before the rest has finished generating. For turn detection, we use smart-turn-v3 — an ONNX model from the Pipecat team, running on Modal — at P50 ~150ms. Add LLM first-token (~350ms) and TTS first audio (~200ms), and the median time from candidate silence to Monty&#39;s first word is about 700ms. Anything past a second starts to feel broken.
<h3>End-to-end latency — candidate silence to Monty&#39;s first audio</h3>
Figure 2 — Smart-Turn classifies the end of the candidate&#39;s turn (150ms p50) → LLM streams its first token (350ms p50) → TTS produces first audio (200ms p50). STT runs as a continuous stream — no added latency. TTS begins on the LLM&#39;s first sentence while the rest is still generating.
If you cut in too early, you end up interrupting a candidate mid-thought; wait too long and it starts to feel like lag. After tuning against candidate self-reported experience and session completion rates, we settled on 900ms as the production threshold — enough pause that candidates feel heard without letting silence linger. This breaks down into three parameters: a 120ms floor prevents triggering on mid-sentence pauses, a 1.6s ceiling is the hard fallback, and the right setting shifts by interview segment, and some special handling for VAD. These numbers came out of rounds of A/B testing against session completion rates; none were set once and left alone.
<h3>End-of-turn pause distribution — drag to adjust threshold</h3>
Figure 3 — Each circle is a candidate&#39;s end-of-turn pause duration. Red dots are candidates Monty interrupts; green dots feel natural; yellow dots are candidates left waiting in silence. Drag the slider to see the tradeoff.
Threshold tuning only gets you so far. Short acknowledgments — &quot;yes,&quot; &quot;uh-huh,&quot; &quot;got it&quot; — often lack trailing silence: the candidate is done, but voice activity detection (VAD) never fires because the signal is too brief. A 400ms aggregation timeout catches these; without it, Monty waits indefinitely for speech that has already ended.
The other gap is echo: Monty&#39;s TTS output can leak through the candidate&#39;s microphone and get transcribed as candidate speech. We run a simple LLM-based classifier that detects when the candidate is essentially repeating Monty back to itself, and discards those turns. Both are invisible at small scale.
<h2>Giving every candidate the right interview</h2>
Mercor runs interviews across hundreds of job categories — engineers, bankers, lawyers, data scientists. The obvious approach is one assessment per job title; we built that first, and it doesn&#39;t scale. Maintaining hundreds of distinct assessments is expensive, and small variations between nearly-identical roles — &quot;backend engineer&quot; vs. &quot;platform engineer&quot; — don&#39;t justify separate configs. For candidates, retaking the same interview for every role they apply to is pointless.
We cluster job listings by the skills they actually test, weighted by candidate volume and hiring outcomes. The clusters are fewer than you&#39;d expect — most job titles, stripped of their labels, test a fairly small number of underlying things. The dominant one is what we call Domain Expert: a single assessment that covers medicine, economics, history, law, software architecture, and almost any other knowledge domain. It accounts for over 70% of all sessions. Add code and language assessments and you&#39;ve covered 90% of volume with three types.
<h3>Sessions per day — click a category to explore</h3>
Figure 4 — Session volume treemap (areas ∝ cumulative historical session totals by category; labels show daily counts). Domain Expert Interview is the most common assessment type by a wide margin.
Completing Domain Expert once qualifies a candidate for every role in that cluster, with no retakes required. We deploy unified assessments as the default for all listings. Today, more than half the offers on Mercor go out proactively — the candidates didn&#39;t apply to those jobs, they just took an interview some time ago.
The interview is personalized before it starts. We process the candidate&#39;s resume before the session begins, and that context shapes what gets asked, how deeply to probe, and what to skip. For coding interviews, the problem is generated fresh from that profile and the live conversation: a Go engineer and a Python engineer get different starter code; a senior candidate gets a more open-ended problem than a new graduate.
All of this means that interviews are efficient. While we think the AI interviewing experience is fun, we try to make the most of the time our users choose to spend with us.

Lucas Rothman

Richter Brzeski

How Mercor keeps AI interview sessions alive at scale, tunes turn-taking latency, and routes candidates to the right assessments.

Engineering Monty: Scaling an AI Interviewer

Engineering Monty: Scaling an AI interviewer

<h1>Applied Compute&#39;s model, trained on Mercor’s agentic data, is now top of the APEX-Agents leaderboard in corporate law</h1>
In January, we showed that <a href="https://www.mercor.com/blog/expert-data-drives-model-performance/" target="_blank" rel="noopener noreferrer" class="text-indigo-600">fewer than 1,000 expert-labeled tasks</a> could double an open-source model&#39;s performance on APEX-Agents, our frontier benchmark for cross-application, long-horizon tasks in professional services. The training trendline was near-linear, signaling that more data would keep yielding gains so we scaled the dataset for <a href="https://appliedcompute.com/" target="_blank" rel="noopener noreferrer" class="text-indigo-600">Applied Compute</a> to almost 2,000 high-quality cases. They post-trained <a href="https://huggingface.co/zai-org/GLM-4.7" target="_blank" rel="noopener noreferrer" class="text-indigo-600">GLM 4.7</a> and their new model Applied Compute: Small is now top of the APEX-Agents leaderboard in corporate law, with a Pass@1 score of 26.6% and a mean score of 54.8%. 
The new model places 4th on the APEX-Agents leaderboard overall by mean score, a dramatic improvement on GLM-4.7 which is 17th. This is significant as GLM-4.7 is a strong open source model, but at 355B (MoE) and a total context length of 200k, it is much smaller than the commercial-grade models that it is now competing with.
<img src="https://cdn.sanity.io/images/h6s14f4z/production/c3765ea997225a0aa54cc431a56f1cf44fcc49c1-2979x1622.png" alt="Scaling up data lead to significantly higher performance gains on APEX Agents." style="width: 100%; border-radius: 10px;" />
<h2>How has the model’s behavior changed?</h2>
To understand what changed, we compared the trajectories of GLM-4.7 and Applied Compute: Small head-to-head. On average, Applied Compute: Small consumes roughly 2 million tokens per trajectory, 4x the token usage of GLM-4.7, and much closer to frontier models like Gemini 3.1 Pro and Claude Opus 4.6. But GLM-4.7 can be served for as little as $0.50 per million input tokens, while Opus 4.6 costs 20× that for long tasks and GPT 5.2 Pro costs 40×. This represents a substantial difference in the economic utility of each dollar spent on tokens.
<img src="https://cdn.sanity.io/images/h6s14f4z/production/829a6e2bb052f7db9e1d1883d44cfc69f950ad87-2062x1048.png" alt="The probability density function of the number of steps used by Applied Compute: Small and GLM-4.7." style="width: 100%; border-radius: 10px;" />
Applied Compute: Small completes tasks in far fewer steps (43 vs. 72) than its base model. As Figure 1 shows, its step distribution is tightly concentrated, while GLM-4.7 exhibits a long tail extending past 150 steps. This is driven in part by doom-looping behavior, where GLM-4.7 repeatedly attempts broken tool calls or falls into cycles. Applied Compute: Small, by contrast, is more efficient at locating the files it needs—but because those files tend to be large, it incurs substantial token costs when reading them in.
Applied Compute: Small learnt to rely on code execution, using it at least once in 98% of trajectories. It also searches the file system 96% of the time, and typically either reads at least one PDF (84% of the time) or xlsx file (63% of the time). In contrast, GLM-4.7 only uses code execution in 19% of trajectories and often fails to read in the right documents. This is a critical limitation given that APEX-Agents requires navigating the file system to complete each task.
<h2>Has the model lost any capabilities?</h2>
We evaluated GLM-4.7 and Applied Compute: Small on two industry standard benchmarks, HLE and GPQA, to measure whether Applied Compute: Small has regressed, potentially losing other valuable capabilities while improving at APEX-Agents. We see a small improvement on GPQA of 0.4pp and a small decrease on HLA of 1.3pp. Neither of these differences are statistically significant and, acknowledging that we only tested on two benchmarks, we did not find evidence of Applied Compute: Small losing general capabilities.
<img src="https://cdn.sanity.io/images/h6s14f4z/production/5a31a16a5d8ed234927c71ab8989ac7b25d4b070-1466x372.png" alt="Performance of GLM-4.7 and Applied Compute: Small on industry benchmarks. Note: Reference benchmark scores were taken directly from https://z.ai/blog/) [1]" style="width: 100%; border-radius: 10px;" />
We also looked at the length of the final outputs. On APEX-Agents, the average number of tokens in Applied Compute: Small’s final outputs is ~3x that of GLM-4.7 (1,956 vs 631). This is a moderate increase, but indicates that Applied Compute: Small is not just scattergunning information and hoping for the best. The difference is also inflated by the large number of cases where GLM-4.7 responds with a short statement that it cannot find the right files.
<h1>Conclusion</h1>
In January, we showed that fewer than 1,000 expert-labeled tasks could nearly double an open-source model&#39;s performance on APEX-Agents. By scaling to 2,000 cases, Applied Compute: Small now leads the corporate law leaderboard outright, outperforming models that are orders of magnitude more expensive to serve. 
This result reinforces a broader thesis at Mercor -- quality expert data, paired with the right post-training infrastructure, can close gaps on model performance. The behavioral changes we observe (more effective tool use, fewer wasted steps, richer final outputs) suggest that the model is not just memorizing patterns but learning the requirements of professional work.
For Mercor, post-training is a vital part of our stack. It validates the quality of our data, helps us allocate experts to produce measurable and generalizable performance gains, and drives our failure analyses, letting us understand not just how often models fail, but why, and how to fix it. To learn more about <a href="https://arxiv.org/abs/2601.14242" target="_blank" rel="noopener noreferrer" class="text-indigo-600">APEX-Agents</a>, data quality, and post-training at Mercor, contact apex@mercor.com.
Applied Compute helps enterprises build and own AI agents trained on proprietary data, aligned to specific workflows, and designed to continuously learn. To learn more, reach out to team@appliedcompute.com.
<h1>Footnotes</h1>
[1] HLE and GPQA are reported as avg@1 and avg@3 respectively. 95% confidence intervals are calculated naively using the normal approximation to the binomial.
[2] Base models were evaluated using the ReAct agent harness, as introduced by Yao et al. (2022) (<a href="https://arxiv.org/abs/2210.03629" target="_blank" rel="noopener noreferrer" class="text-indigo-600">arXiv:2210.03629</a>). While this richer scaffolding often improves raw task performance, it is not conducive to training or efficient inference as the system prompt changes throughout the trajectory. As such, post-trained models were trained and evaluated using a simple loop agent architecture. For implementation, see here: https://github.com/Mercor-Intelligence/archipelago

Bertie Vidgen

Researcher

Isaac Robinson

Michael Haines

Product Manager

Applied Compute's model, trained on Mercor’s agentic data, is now top of the APEX-Agents leaderboard in corporate law

Scaling Data leads to SOTA Legal Performance on APEX-Agents

<h3>Luna Aizarani has spent her career within expert marketplaces. She joined Mercor as General Manager of Growth to build the systems that connect human expertise to opportunities at global scale.</h3>
I’ve spent over a decade operating on the frontlines of how organizations find and deploy talent, across labor marketplaces and talent intelligence platforms. As AI reshapes entire industries and accelerates economic change, I consistently encountered the same tension: the type of talent in demand was evolving faster than the systems designed to identify and deploy it. Highly capable people were underutilized, while organizations struggled to translate ambition into execution.
Across tech, private capital, and workforce development organizations, the same pattern repeated itself. Companies raced to deploy AI systems faster than their organizations could absorb them, while skilled experts sat on the sidelines or were misallocated. From where I sat, the issue wasn’t intelligence or motivation, it was infrastructure.
Throughout my career, I’ve worked as a commercial leader in high-growth environments. I spent years scaling revenue engines inside professional services organizations, managing global teams, owning P&amp;Ls, and turning underperforming units into durable businesses. More recently, I worked in workforce development and partnerships, helping large institutions respond to rapid shifts in skills, automation, and AI literacy. I found the work meaningful, but I increasingly felt the impact was incremental relative to the scale of the change underway.
When I started talking to Mercor, it felt like someone had finally named the problem clearly and, more importantly, was already building practical solutions to address it.
The AI economy isn’t constrained by ideas; it’s constrained by execution. Models are only as good as the data that trains them, and that data increasingly depends on deep human expertise: engineers, researchers, and domain specialists who can reason, evaluate, and create at a high level. Mercor is the company that identifies real expertise, deploys it quickly, and operates globally. The shift from ad hoc sourcing to programmable talent is what makes the company so central to the future of the labor market. That framing immediately resonated with me.
In my role, I work directly inside that shift. I develop systems that expand access to expertise, connect the right talent to meaningful work, and make complex, human-heavy workflows repeatable. Over time, that responsibility has grown into owning the broader talent pipeline across the company, focusing on ensuring our talent network keeps pace with demand and that experts have a clear, supportive path to contribute. It’s part growth, part operations, part product thinking.
One of the most meaningful aspects of my job has been engaging with people whose expertise isn’t always captured by traditional credentials. I’ve spent time with athletes and coaches, for example, learning how their real-time decision-making, pattern recognition, and tactical judgment can inform how AI systems learn and improve. I’ve seen similar value from farmers, therapists, and sales managers - people whose knowledge compounds through lived experience in the field. We are creating pathways for expertise to shape the systems increasingly influencing how work gets done.
What surprised me most after joining Mercor wasn’t just the team’s ambition, but the rigor of execution. From my earliest conversations with the founders, it was clear that Mercor operates with unusually high standards, very little red tape and a strong culture of ownership. Within days of joining, I was responsible for addressing supply constraints across multiple workstreams, proposing structural fixes, and moving quickly from assessment to execution.
I’m excited about what’s next in my role at Mercor. This includes deepening strategic partnerships, making the expert experience more seamless, and continuing to adapt our systems alongside a rapidly changing labor market. As AI reshapes how work is created and distributed, the opportunity is to build infrastructure that works well for both partners and experts.
For those considering Mercor and feeling unsure, the work here tends to suit people who enjoy building, take real ownership, and are motivated by driving tangible results.
Explore careers at Mercor: <a href="http://www.mercor.com/careers" target="_blank" rel="noopener noreferrer" class="text-indigo-600">www.mercor.com/careers</a>

Luna Aizarani, GM of Growth at Mercor, shares how she’s building systems that connect real human expertise to meaningful work at global scale.

Why I Joined Mercor: Building Infrastructure to Scale Expertise 

Why I Joined Mercor – Luna Aizarani

<h3>Peter Zhang, Mercor Engineering Lead, shares why he transitioned from a career in quantitative finance to be at the forefront of AI.</h3>
Why do we pay quantitative traders <a href="https://work.mercor.com/jobs/list_AAABmpZDfV9yE40R8g1Hw4j3/quantitative-traders" target="_blank" rel="noopener noreferrer" class="text-indigo-600">$300/hr</a>?
We pay them to resolve challenging ambiguity—questions about the world that are difficult to scope and harder still to answer. Markets cannot afford to be wrong about the future of financial capital. Trillions of dollars swing on how the world will look in one, five, and twenty-five years, making markets early bells for world-altering shifts.
The bells have been ringing. AI is the question mark of the decade. Markets are scrambling to assess the scale of the buildout, the potential productivity gains, the ultimate winners. Machine intelligence is rippling through industry and everyday human life, raising new questions about our future. Last year, from the trading floor of Jane Street, I felt the gravity of these questions and frustration at the limits of a trading terminal. I wanted to get closer to the ambiguity.
I wanted to study what AI would mean for human capital. Quantitative finance understands talent better than any industry. Firms live and die by identifying, measuring, and utilizing talent. Well before Jane Street, I wondered whether we could encode our heuristics about how to do work into rules and models. These questions have always been considered too complex, too human, to systematize—so we&#39;ve built entire fields of organizational psychology and management theory on intuition rather than data, surrendering to subjective judgment what should be measured and understood.
In early 2025, I heard from a former colleague about a startup called Mercor. They were taking the best experts from around the world and pushing the frontier of model capabilities. They were hiring thousands of experts every month across hundreds of industries and verticals. They were operating at a scale where they couldn’t afford to be wrong about human capital. It was the perfect frontier.
Mercor moves fast. I left New York in April. By May, I was describing my work to a friend. &quot;How can you agree to work on something that might even not be possible?&quot; he asked. &quot;There’s no roadmap!&quot; I was working on assessments: how to move past resumes and directly measure skill. These problems felt more like research papers than product specs: What makes an interview “good?” How can AI <a href="https://talent.docs.mercor.com/support/ai-interview" target="_blank" rel="noopener noreferrer" class="text-indigo-600">run and score interviews</a> at scale? How do you quantify performance when you&#39;re hiring across industries as different as software engineering and law? &quot;The ambiguity,&quot; I told him, &quot;is exactly why I’m here.&quot;
Since then, Mercor has run over a million AI interviews and placed thousands of people into roles they wouldn&#39;t have found otherwise. Traditional screening would have filtered out many of our top performers—the self-taught engineer in Lagos, the Portuguese lawyer. Our models found them and they&#39;re excelling.
These days, I spend my time engineering solutions to new questions. How do we structure teams that multiply rather than average individual brilliance? How do we measure who&#39;s truly excelling when simple metrics fail? And how can we automate these processes without losing the nuance that makes them work?
AI is the new center of gravity for the brightest minds. If you’re excited to face challenging ambiguity at the intersection of AI and how we do work, I’d love to <a href="https://www.mercor.com/careers/" target="_blank" rel="noopener noreferrer" class="text-indigo-600">work with you</a>.

An Engineering Lead at Mercor shares why he left quantitative finance to build AI systems for measuring human talent at scale.

Why I Joined Mercor: From Quant Finance to Engineering

Why I Joined Mercor – Peter Zhang

States that Summit's shareholder distributions are disproportionate

States that Summit's disproportionate distributions violate US tax code

States that a nonresident alien receiving shares violates tax code

<h2>How Mercor’s data and Applied Compute’s long-horizon RL unlock real capability gains for AI models</h2>
Mercor partnered with <a href="https://appliedcompute.com/" target="_blank" rel="noopener noreferrer" class="text-indigo-600">Applied Compute</a> to post-train an open-source model using one of our expert-labeled dev sets, resulting in substantial performance gains on the <a href="https://www.mercor.com/apex/apex-agents-leaderboard/" target="_blank" rel="noopener noreferrer" class="text-indigo-600">APEX-Agents</a> benchmark.
With fewer than 1,000 high-quality data points from Mercor, the post-trained model&#39;s Pass@1 and mean score nearly doubled. On the corporate law evals, the Pass@1 score tripled. The training trendline is near-linear, indicating that additional data would likely keep yielding performance gains.
<img src="https://cdn.sanity.io/images/h6s14f4z/production/c15fd60509a1f5ad2dbd7209a4d47d68a0ea85da-1860x510.png" alt="APEX Score across training showing consistent improvement" style="width: 100%; border-radius: 10px;" />
APEX-Agents assesses whether agents can execute the real day-to-day work of investment banking analysts, management consultants, and corporate lawyers. There are 480 tasks, created by a team of Mercor experts: Vice Presidents, Managing Directors, and Managers with 10+ years of experience at top-tier firms. They simulated the demands of the profession to challenge agents to navigate instructions, use a range of applications, manage complicated file systems, and plan over long horizons. See the <a href="https://arxiv.org/abs/2601.14242" target="_blank" rel="noopener noreferrer" class="text-indigo-600">technical report</a> for more details. You can also download the dataset and all of the files from Mercor’s <a href="https://huggingface.co/datasets/mercor/apex-agents" target="_blank" rel="noopener noreferrer" class="text-indigo-600">Hugging Face</a> and our infra service, Archipelago, from our <a href="https://github.com/Mercor-Intelligence/archipelago" target="_blank" rel="noopener noreferrer" class="text-indigo-600">GitHub</a>. For this experiment we used the full benchmark (n=480), measuring Pass@1, Pass@3, and mean criteria passed.

<h1>Training set up</h1>
Mercor provided a dev set of 874 tasks for post-training, split across 50 unique “worlds” of data that represent a scenario. Each world has data and applications that can be found in a real enterprise environment, such as Google Sheets, Docs, and Code Execution. None of the tasks or worlds appear in the APEX-Agents benchmark.
Applied Compute deployed its proprietary long-horizon RL stack to stress-test data quality under realistic training dynamics and measure whether gains transferred to the hardest APEX-Agents tasks. Training was performed single-epoch with no SFT warmup, no filtering, and no task or rubric modifications. Two held-out worlds per domain were reserved for validation to detect overfitting.
Applied Compute evaluated frontier and open-weight models on APEX-Agents to establish baseline capability and identify where learning was possible. GLM 4.6, a mid-scale 355b parameter MoE model with 32b active parameters, emerged as an appropriate model to start from, offering the right tradeoff between iteration speed and baseline competence. It scores 3.8% with Pass@1 and 12.1% based on mean score, which is typical for an open-source model on APEX-Agents.
<h1>Results and continued training gains</h1>
The post-trained model outperforms the baseline across all metrics, with the largest gains in corporate law (Table 1).These improvements come from just 874 expert-labeled tasks. In a low-data regime, the dominant risk is not under-training the model, but misallocating scarce expert effort.
RL training and evaluation reduced this risk by turning each run into a high-signal measurement. By running end-to-end RL with detailed behavioral observability, Applied Compute enabled Mercor to see how much benefit was being delivered by the data, demonstrating that hundreds, not tens of thousands, of examples are sufficient to drive real gains.
Full trajectory-level observability showed how models attempted tasks, allowing Applied Compute to distinguish learning from undesirable behaviors like refusal or premature termination that can be masked in aggregate metrics. Targeted ablations over reasoning budget, tool access, and training configuration were evaluated against optimized prompt baselines to isolate training gains. When issues were identified, fixes were validated through retraining without adding new data. Training eval trendlines showed consistent improvement. Gains across law, banking, and consulting were close to linear, which is atypical in data-limited settings and signals strong alignment between the data and target capabilities. Rapid feedback from training runs allowed expert effort to be redirected toward concrete gaps instead of expanded uniformly. We anticipate that similar amounts of high-quality data will continue to deliver benefits.
<img src="https://cdn.sanity.io/images/h6s14f4z/production/0b5b233542dc1f4af8354bc7dabd5d472f4efb24-3840x2160.png" alt="Training trendline based on APEX Score" style="width: 100%; border-radius: 10px;" />
<h1>Example trajectory</h1>
One of our worlds (see below) shows a corporate lawyer on a due diligence project, tasked with analyzing whether the company’s shareholder payouts comply with US tax law. The world contains a full suite of corporate records: shareholder agreements, the distribution schedule, the original S-Corp election filing form, and a reference copy of the tax code that governs S-Corporations.
The baseline model produces a professional looking but factually flawed memo that incorrectly states that the corporation is in substantial compliance with US tax code. It used 18 tool calls and worked linearly. It listed files, read the distribution schedule and tax returns before submitting its final answer. The baseline model took the documents at face value and justified compliance despite evidence that proved otherwise. It correctly found the shareholder and payment records, but assumed the distribution matches the ownership stakes rather than checking.
In contrast, the post-trained model correctly states that the company is non-compliant. It also finds out that one shareholder was overpaid in 2022 and 2024 by correctly citing 26 USC §1366(a)(1) regarding the failure of pro-rata allocations and stating that the transfer to one individual violates §1361(b)(1)(C). The post-trained model improved in its reasoning and planning. It spent 9 steps just on reading specific US tax code sections, then executed several Python scripts to check the math on the distributions. It even performed a self-correction when it realized it hadn&#39;t completed its checklist.
<h1>Conclusion</h1>
This experiment demonstrates that quality datasets can push open-source models to unlock new capabilities, adopting the rigorous approach required for high-stakes professional work. With under 1,000 tasks and one epoch of training, we saw dramatic improvements and have high confidence in further gains from adding more data. The benefits of expert-created specialized data will only increase as training stacks adapt and models become more capable. Run evals yourself on <a href="https://huggingface.co/datasets/mercor/apex-agents" target="_blank" rel="noopener noreferrer" class="text-indigo-600">APEX-Agents</a> now.
<h1>Example prompt with responses from the baseline and post-trained models</h1>

How Mercor's data and Applied Compute's long-horizon RL unlock real capability gains for AI models

Expert data drives model performance

Today, we’re introducing <a href="https://www.mercor.com/apex/apex-agents-leaderboard/" target="_blank" rel="noopener noreferrer" class="text-indigo-600">APEX-Agents</a>, our new benchmark designed to test how well AI agents complete real, long-horizon tasks in investment banking, consulting, and corporate law.
It’s the latest addition to the AI Productivity Index (APEX), Mercor’s family of benchmarks that measure economically valuable capabilities, joining <a href="https://www.mercor.com/apex/apex-v1-leaderboard/" target="_blank" rel="noopener noreferrer" class="text-indigo-600">APEX</a> for tool-free evals and <a href="https://www.mercor.com/apex/ace-leaderboard/" target="_blank" rel="noopener noreferrer" class="text-indigo-600">ACE</a> for consumer applications.
With APEX-Agents, we set out to find if today’s AI agents do economically valuable tasks. Are they ready to work with teams, in real software tools, and deliver client-ready work?
Workplace context is messy, incomplete, and spread across documents and chat threads. Tasks take hours, not seconds.
Most existing benchmarks don’t reflect that. They evaluate models on isolated prompts or narrow skills. They don’t measure whether an agent can navigate multiple workflows and produce something a manager or client would accept.
That is why we approached APEX-Agents differently. Every task is designed to mirror the complex work that professionals do and the tasks they wish an AI agent could help with. This let us identify cases where current models do not perform reliably.
<iframe src="https://player.vimeo.com/video/1156906982" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write; encrypted-media; web-share" referrerpolicy="strict-origin-when-cross-origin" title="Mercor APEX-Agents" class="mb-8 mt-4 aspect-video w-full max-w-[1080px] rounded-lg bg-gray-100 object-cover sm:mt-6"></iframe>
<h2>Our approach</h2>
APEX-Agents simulates the demands of the profession to challenge agents to navigate instructions, manage complicated file systems, and produce outputs that justify a professional fee.
Our four-step approach:
Surveys: We started out by surveying hundreds of experts from professional services including Goldman Sachs, McKinsey, and Cravath to understand how they spend their time.
Scenarios: A team of Mercor experts—Vice Presidents, Managing Directors, and Managers with five-to-ten years’ experience at top-tier firms—worked in Google Workspace, simulating how coworkers would collaborate on a project.
This might look like a week-long consulting project for a fictitious European oil &amp; gas company focused on cost-cutting measures.
We worked with <a href="https://www.box.com/" target="_blank" rel="noopener noreferrer" class="text-indigo-600">Box</a> to define what a real-world file system looks like. We mapped rigorous, domain-specific challenges to the complex file structures found in a mix of industries, resulting in datarooms that truly simulate the daily workflow of a professional.
The result is a high-context workspace that mimics what professionals navigate every day.
Task creation: Using these custom scenarios, the experts defined specific tasks and the exact grading criteria that define what “client-ready” means. Each task includes 1–10 pass/fail criteria.
<a href="https://www.harvey.ai/" target="_blank" rel="noopener noreferrer" class="text-indigo-600">Harvey AI</a> provided early feedback on the design, scope, and realism of the corporate law worlds. They confirmed the complexity and value of the tasks in APEX-Agents, validating that they reflect the work of top lawyers at Fortune 500 enterprises and law firms.
Evals: We deployed AI agents inside these worlds using our open source evaluation infrastructure, Archipelago, to measure whether they finish the work correctly.
<img src="https://cdn.sanity.io/images/h6s14f4z/production/7bc182cda45d843f72fa787ae2fc5e49c2bcbd63-2756x1230.png" alt="APEX-Agents" style="width: 100%; border-radius: 10px;" />
<h2>The findings</h2>
Frontier models successfully complete less than 25% of tasks that would typically take professionals hours. With multiple attempts, performance improves but the gap remains large. Even with 8 tries, the best agents can only complete 40% of the tasks.
Many agents fail not due to lack of capability, but because they can’t manage ambiguity, find the right file, or hold context across the entire workflow.
No model is ready to replace a professional end-to-end.
<img src="https://cdn.sanity.io/images/h6s14f4z/production/f0ac6afea54245c059dc1cacb8fbe316c936e850-3840x2142.png" alt="" style="width: 100%; border-radius: 10px;" />
<h2>Open source</h2>
We have released the entire benchmark open source via <a href="http://huggingface.co/datasets/mercor/apex-agents" target="_blank" rel="noopener noreferrer" class="text-indigo-600">Hugging Face</a> with a CC-BY license and Archipelago, our infra, eval service, and set of apps, is available as an open source repo on <a href="https://github.com/Mercor-Intelligence/archipelago" target="_blank" rel="noopener noreferrer" class="text-indigo-600">GitHub</a>.
Read the full research paper on <a href="https://arxiv.org/pdf/2601.14242" target="_blank" rel="noopener noreferrer" class="text-indigo-600">arXiv</a>.

Osvald Nitski

Mercor APEX-Agents is a benchmark designed to test how well AI agents complete real, long-horizon tasks in investment banking, consulting, and corporate law.

Introducing APEX-Agents

We created the first version of the AI Productivity Index (APEX) to assess whether frontier models can perform high-value work in Investment banking, Management consulting, Law, and Medicine. We found that even the very best models struggle on complex real-world tasks, failing to meet the production bar.
Mercor has now doubled the size of the heldout evaluation set in APEX from n=200 to n=400. This larger eval set allows us to more consistently evaluate frontier models’ ability to perform tasks that create economic value. The design of the cases remains the same (comprising prompts and source documents, and a grading rubric) but we have increased their complexity and variety. On average, tasks in APEX take over two-and-half hours for seasoned professionals to complete in the real-world. The contributors to APEX typically had over 7 years of experience.
We have also refined the evaluation methodology in APEX, increasing the number of runs we execute to 8, simplifying the grading process, and adding confidence intervals to our results. Read the <a href="https://arxiv.org/abs/2509.25721" target="_blank" rel="noopener noreferrer" class="text-indigo-600">technical report</a> to find out more about our approach to evaluation and how the dataset was created.
<h2>Results</h2>
The best performing model (GPT 5, Thinking = High) has the highest mean score at 67%, followed by Gemini 3 Pro (Thinking = High) at 64.3% and Grok 4 at 63.5%. Models&#39; scores are lowest on Investment banking, with the highest scoring model achieving 63.0%, followed by Management consulting (top score = 64%), Medicine (top score = 65.5%), and, with substantially higher scores, Law (top score = 77.9%). APEX shows that models struggle with real-world tasks that professionals undertake everyday. The new leaderboard is available <a href="https://www.mercor.com/apex/apex-v1-leaderboard/" target="_blank" rel="noopener noreferrer" class="text-indigo-600">here</a>.
Gemini 3 Pro (Thinking = High), released in November 2025, is a substantial improvement on 2.5, beating it out by over 5 percentage points overall, and on Investment banking by 9 percentage points. Similarly, we see Opus 4.5 (Thinking = On), also released in November, beating Sonnet 4.5 and Opus 4.1 by 6 percentage points and 12 percentage points respectively. These substantial improvements speak to a meaningful step forward in models’ ability to perform high-value tasks. GPT 5.1 is an exception and does not improve on GPT 5 (which remains the leaderboard champion), but this is perhaps not surprising given that 5.1 is primarily meant to be more conversational and explanative, rather than better at complex reasoning. Finally, given it is the only model without an explicit reasoning setting, Grok 4, is remarkably strong and comes 4th overall.
If you want to add your model to the leaderboard or run a loss analysis, contact the Mercor Applied AI Research team at apex@mercor.com.
<h2>Open source eval set</h2>
We are releasing n=100 open-source cases (APEX-v1-devset) on <a href="https://huggingface.co/datasets/mercor/APEX-v1-extended" target="_blank" rel="noopener noreferrer" class="text-indigo-600">Hugging Face</a> with a CC-BY licence for anyone to train on, evaluate against, and research. The open-source cases were created by the same annotators through the same production pipeline. We are also open-sourcing our <a href="https://github.com/Mercor-Intelligence/apex-evals/tree/main/apex-oss-eval" target="_blank" rel="noopener noreferrer" class="text-indigo-600">evaluation harness</a> so you can exactly match our grading approach. We are looking forward to seeing what the community builds and would love to hear from any researchers using our data.

Doubling the size of the AI Productivity Index to better measure AI models’ economic value.


Expanding the Mercor AI Productivity Index

Expanding the AI Productivity Index (APEX)

Today, we&#39;re releasing our first version of the AI Consumer Index (ACE).
ACE tests what people actually want AI to do—from finding a gift for a friend to getting a custom recipe recommendation or fixing a hole in their drywall.
ACE contains realistic and challenging evals, split across shopping, food, gaming and DIY, created by experts from the Mercor platform. We are excited to share our new <a href="https://www.mercor.com/apex/ace-leaderboard/" target="_blank" rel="noopener noreferrer" class="text-indigo-600">leaderboard</a>, <a href="https://arxiv.org/abs/2512.04921" target="_blank" rel="noopener noreferrer" class="text-indigo-600">technical report</a>, <a href="https://huggingface.co/datasets/mercor/ACE" target="_blank" rel="noopener noreferrer" class="text-indigo-600">open source dataset</a>, and <a href="https://github.com/Mercor-Intelligence/apex-evals" target="_blank" rel="noopener noreferrer" class="text-indigo-600">eval harness</a>.
<h2>Results</h2>
We show that models routinely fail on consumer tasks—GPT 5.1 (Thinking = High) is the top model but scores only 56.1%. The next best models are GPT 5 (Thinking = High) and o3 Pro (Thinking = On). The best performing model from Google is Gemini 3 Pro (Thinking = High) and from Anthropic is Opus 4.5 (Thinking = On), showing how the latest model releases are steadily improving at consumer tasks.
We see substantial differences in model performance across the four domains. No models score over 50% on Shopping tasks, an opportunity worth $5+ trillion globally – but models perform much better in Food. GPT 5 (Thinking = High) hits an impressive 70%, beating out the next best model by 10 percentage points. In DIY and Gaming we see the biggest range, with model scores ranging from 28% to 61%. Beyond these high-level stats we see a lot of tasks where models fail, often scoring under 25%.
<img src="https://cdn.sanity.io/images/h6s14f4z/production/e93be49858bb82d553eac79045d8872cd61082ca-1974x2074.png" alt="Mercor AI Consumer Index (ACE)" style="width: 100%; border-radius: 10px;" />
<h2>Grading approach</h2>
We introduce a novel rubric-based evaluation methodology with ACE.
Each task has a rubric of prompt-specific criteria to scalably grade model responses. Every rubric contains at least one hurdle criterion that must be passed before further rewards can be unlocked. The hurdles require the model to meet the user’s core objective -- such as, in Shopping, returning the requested type of product or, in DIY, providing a solution to the user&#39;s problem. The hurdles are important for minimizing the risk of reward hacking. We do not want to reward responses that are mostly irrelevant but still meet a specific requirement (e.g., the response returns any item under $50).
We also add grounding criteria that penalize models for making claims that are not supported by the retrieved web sources (i.e., hallucinating) or providing non-working links. These account for 42% of Gaming criteria and 74% of Shopping criteria, and do not appear in DIY or Food tasks. GPT 5.1 (Thinking = High) is the most grounded model, failing just 29% of the grounding criteria, whereas Gemini 3 Pro (Thinking = High) is the least grounded, failing on 62%.
<img src="https://cdn.sanity.io/images/h6s14f4z/production/080cfaea1586645252936f4e259b6437e81e6098-1744x1108.png" alt="" style="width: 100%; border-radius: 10px;" />
<h2>Loss analysis</h2>
We built ACE to provide fine-grained insight into the performance of models – and every item in the rubrics are tagged with the criteria type, which can be used for loss analysis of model performance.
For instance, ACE shows that models perform well at meeting quantity requirements (nearly all models score 80% on this criteria type in Food) but they are much worse at meeting nuanced requests, like compatibility requirements in Gaming (most models score under 40%) or providing suitable safety warnings in DIY (most models score under 50%). For some criteria types, models find them so difficult that their mean score is actually negative – like providing working links or giving the price.
<img src="https://cdn.sanity.io/images/h6s14f4z/production/e230c5984a08bbb2096d1df3930ca9274a9e61ce-1992x1500.png" alt="" style="width: 100%; border-radius: 10px;" />
<h2>Open sourcing</h2>
We are open sourcing 20 cases in each of the four domains (n = 80 total) on <a href="https://huggingface.co/datasets/mercor/ACE" target="_blank" rel="noopener noreferrer" class="text-indigo-600">Hugging Face</a>. The data has a similar composition and is similarly difficult to the hidden heldout set. You can also use our evaluation harness (now on <a href="https://github.com/Mercor-Intelligence/apex-evals" target="_blank" rel="noopener noreferrer" class="text-indigo-600">Github</a>) to reproduce our grading approach. If you build with ACE, please let us know. To submit your model for testing, email apex@mercor.com
ACE is only possible because of the incredible work of our experts. Thank you to everyone who took part in the project!
<img src="https://cdn.sanity.io/images/h6s14f4z/production/6745667436ccc0ed66fe3708aaa5dd6c0065bd26-2240x1184.png" alt="" style="width: 100%; border-radius: 10px;" />

AI Consumer Index tests what people actually want AI to do across shopping, food, gaming and DIY, created by experts from the Mercor platform.

Introducing the AI Consumer Index | Mercor

Introducing the AI Consumer Index

Knowledge is passed from one person to another. Students learn from teachers. Employees learn from managers. Apprentices learn from masters. Progress is made through people.
Since we founded Mercor almost three years ago, AI has advanced at an astonishing pace. But it still struggles with the subtleties that drive economically valuable work—balancing trade-offs, understanding intent, developing taste, and deciding what should be done, not just what can be done.
That&#39;s where our work begins.
<h2>Unlocking human potential</h2>
Mercor sits at the intersection of labor markets and AI research. We connect human expertise with leading AI labs and enterprises that drive the AI economy.
At Mercor, our vast talent network trains frontier AI models in the same way teachers train students: by sharing knowledge, experience, and context that can&#39;t be captured in code alone. Each project expands what models understand.
Each advance in AI, in turn, unlocks human potential.
We see this through the progress our experts make every day:
<ul><li>A doctor shaping how AI recognizes early warning signs that could be missed by the human eye.</li><li>A banker training an agent to do financial analysis far more efficiently than redundantly analyzing data rooms.</li><li>A lawyer refining how models reason through precedent to improve legal judgment.</li></ul>
<h2>Defining a new category of work</h2>
AI is creating opportunity and accelerating human capability.
Millions of people will spend the next decade teaching machines the judgment, nuance, and taste that only humans possess. Instead of doing predictable work repeatedly, they&#39;ll teach agents how to do it once, so the agent can do it a million times.
Enterprise value chains are already shifting this way. Professionals can now codify workflows as evaluations that models can learn and improve from.
In turn, our experts are moving up the value chain. They&#39;re unlocking their potential by offloading rote tasks to agents and focusing on more economically valuable work that AI can&#39;t reliably do.
<img src="https://cdn.sanity.io/images/h6s14f4z/production/fa6d7579746d6250ae023920130bc6c75266d50a-3386x924.png" alt="series-c-metrics" style="width: 100%; border-radius: 10px;" />
<h2>Investing in our future</h2>
Today, we&#39;re announcing our $350 million Series C funding, led by Felicis with participation from Benchmark, General Catalyst, and Robinhood Ventures. This round values Mercor at $10 billion, 5x our Series B valuation.
The investment accelerates our mission across <a href="https://mercor.com/blog/big-things/" target="_blank" rel="noopener noreferrer" class="text-indigo-600">three focus areas</a>:
<ul><li>Vast talent network</li><li>Better matching</li><li>Faster delivery</li></ul>
We are uniquely positioned to create this new category of work and shape how human and artificial intelligence create more economic value.
We&#39;re hiring. Join us: <a href="https://mercor.com/careers/" target="_blank" rel="noopener noreferrer" class="text-indigo-600">https://mercor.com/careers/</a>

Announcing Mercor's $350 million Series C at a $10 billion valuation.

Unlocking Human Potential in the AI Economy

Unlocking human potential in the AI economy

Many believe gains made in AI must be zero-sum that as AI improves, human opportunity will decline. But that&#39;s not what we&#39;re seeing.
Our experts collectively earn more than $1M a day. Our customers are today&#39;s most consequential companies. We are the fastest-growing company in history.
We are delivering on a mission to unlock human potential. We are uniquely positioned to create a new category of work that values and progresses human expertise in the AI economy.
We will focus on the following three Big Things:
<h2>Three Big Things</h2>
Vast talent network: Our competitive moat is the network effects from being obsessed with our talent experience. We are proud that more than half of our new experts come through referrals. We won&#39;t lose sight of that. That’s why we will continue to make long-term investments in sourcing, identifying, and retaining top performers who drive disproportionate value for our customers.
Better matching: The right person matched to the right opportunity can create 10x more value. We&#39;ll keep advancing our matching systems to understand each expert&#39;s capabilities and pair them with the opportunities where they&#39;ll have the greatest impact.
Faster delivery: Speed extends across the entire life cycle of data production. We&#39;re building new products and automating everything from fraud detection and onboarding to analytics, review, and testing.
<h2>The Future</h2>
Human data is the foundation of the new economy. Human insight will guide AI, not compete with it.
The total market for human data is defined by the things humans can do that models can&#39;t. Models might be superhuman at Olympiad math, but they still can&#39;t draft an email, negotiate a deal, or understand the tone of a legal argument. Humans give models that missing context. As AI continues to expand what is possible, new categories of work will be created.
This is why the <a href="https://mercor.com/blog/the-economy-will-become-an-rl-environment-machine/" target="_blank" rel="noopener noreferrer" class="text-indigo-600">economy will become an RL environment machine</a>, and every enterprise value chain will converge on evals.
Our job is to unlock the human potential that will drive the AI economy forward.
These are our &quot;Big Things&quot; and they will be for decades to come.

Our focus to deliver on a mission to unlock human potential.

Big things

The biggest obstacle to AI delivering on its economic potential is the gap between existing AI evaluations and what professionals do in the real world. We&#39;re launching the AI Productivity Index (APEX) to start bridging this gap.
APEX is a first-of-its-kind benchmark that evaluates AI models based on their ability to perform economically valuable knowledge work. With future releases, we will expand coverage to other industries, roles, and countries.
APEX is gauging how AI systems are increasing productivity, reshaping the workforce and creating economic value. Alongside other initiatives in the research community, such as OpenAI’s GDPval, we are proud to guide the development of the next generation of AI models to make them more useful. AI is already superhuman at Olympiad math, but these capabilities can be disconnected from what drives the economy. It&#39;s great to have 10,000 PhDs in your pocket—it&#39;s even better to have a model that can reliably do your taxes.
<h2>Constructing APEX</h2>
APEX v1.0 consists of 200 cases split evenly across investment banking, law, consulting, and medical practice. Each case consists of a prompt (task description), sources (information needed to complete the task), and a rubric (criteria for grading model responses). We constructed APEX v1.0 in five steps:
<ol><li>Sourcing experts: we assembled a team of ~100 experts with top-tier experience (e.g., investment bankers from Goldman Sachs) across four professions.</li><li>Prompt generation: experts generate task descriptions, or prompts, describing common workflows in each domain. These workflows are aligned with economic value creation in two ways. Tasks are based on deliverables: prompts ask models to generate deliverables, e.g., a patient diagnosis by a doctor or a competitive research memo by a consultant. Thus, each task in APEX, if completed, would generate genuine economic value. Task distributions: the distribution of prompts exactly matches the experts&#39; estimate of the share of time spent on each workflow. For example, in investment banking, financial modeling and valuation (FMV) tasks were estimated to take up 30% of analysts&#39; time. Thus, FMV tasks comprise 30% of the APEX investment banking benchmark.</li><li>Source generation: experts produce source documents that contain relevant evidence needed to respond to the prompts. For example, in medical practice, a source document might be a real or synthetic set of CDC recommendations. On average, each prompt is accompanied by 5.83 source documents, comprising (on average) 26,000 tokens.</li><li>Rubric generation: experts produce a rubric of prompt-specific criteria. Each criterion is an objective and self-contained statement about the response. For example, if the prompt asks the model to analyze growth opportunities for Delta airlines in 2025, one criterion could stipulate &quot;the response mentions Delta airline&#39;s 2025 revenue.&quot; Rubrics have an average of 29.09 separate criteria.</li><li>Quality control: After prompts, sources, and rubrics have been generated by an expert, a separate expert reviews them to ensure quality control. In total, 300 prompts were started by contributors, of which 200 were approved by reviewers and added to APEX v1.0.</li></ol>
Experts estimated that tasks in APEX would take a professional between 1 and 8 hours (3.5 on average).
<img src="https://cdn.sanity.io/images/h6s14f4z/production/1460f4cf0e83f35cd47cb8436633ff0922afb3c8-1296x346.png" alt="table-1-apex" style="width: 100%; border-radius: 10px;" />
<h2>Evaluation &amp; Results</h2>
<img src="https://cdn.sanity.io/images/h6s14f4z/production/623da020d622db1aa3f9a1766fb3c131e1e55090-2040x1200.png" alt="top-3-models" style="width: 100%; border-radius: 10px;" />
We evaluate 21 state-of-the-art models using APEX v1.0, including top-performing closed and open source models. Each model is given the prompt and sources as an input, and then outputs its response as long-form text. The text responses are graded by a panel of LLM judges (majority vote) according to the expert-generated rubric. The overall score for each response is defined as the average number of criteria satisfied.
This rubric-based evaluation system–also used in OpenAI’s excellent <a href="https://openai.com/index/healthbench/" target="_blank" rel="noopener noreferrer" class="text-indigo-600">HealthBench</a> work–allows us to automatically grade new models as they are released while maintaining consistency and objectivity. We report our findings on auto-grading consistency in our white paper.
GPT-5 achieved the top score of 64.2% and the best performing open source model, Qwen3, was ranked 7th overall with 59.8%. The top scores were highest in Law (70.5%) and worst for Investment Banking (59.7% ). None of the models meet the production bar for automating real-world tasks in the four professions we looked at. In a live setting, all would require substantial human oversight.
That said, the initial results provide a basis for optimism. APEX-v1.0 is highly complex, and its tasks require advanced reasoning, synthesis, multi-hop knowledge handling, and expert-level critical-thinking. This is reflected in the time that annotators estimated it would take them to complete the tasks (3.5 hour average). Models that could autonomously complete these tasks have the potential to unlock hundreds of billions of dollars of value across the U.S. economy.
<h2>Future work</h2>
APEX-v1.0 represents a step forward towards economically meaningful evaluations of AI models. However, it is an imperfect benchmark, and we plan to improve APEX over time:
APEX world: We are working on introducing simulated environments to allow AI models to interact with clones of common applications like SharePoint, Google Workspace, and other external tools via MCP, API, and GUI. We are hiring experts to populate these worlds with data as if they are working in mock companies. Future iterations of APEX will then evaluate models&#39; ability to interact in these worlds, with many more tools and files available to the models. We aim to bridge the sim2real gap in world fidelity, data distribution, and task realism.
Expanding the benchmark: We will broaden the range of professions and task types included, particularly in creative and technical fields. In partnership with academic and industry leaders, we will increase the granularity of the first four domains to show results in specific groups of workflows and areas of practice.
What does 60% mean? No models are close to achieving a 100% score on APEX. That said, in some cases, a performance of 60% may already add substantial economic value–for example, perhaps a consultant can more easily complete a competitor analysis if given an initial draft from an AI. On the other hand, in some domains, inaccuracies may be actively harmful–an AI-generated diagnostic report might ultimately waste a doctor&#39;s time if it needs to be carefully fact-checked for inaccuracies before it can be used. More research is needed to understand the economic impact of imperfect models.
For questions, feedback, or to get involved, reach out to us at apex@mercor.com.

A New Benchmark Measuring the Economic Impact of AI

Introducing APEX: The AI Productivity Index

Every technological revolution has sparked fears of job loss. The industrial revolution displaced domestic producers with machines. The computer revolution displaced manual clerical work with spreadsheets and databases.
And yet, unemployment rates are lower today than both before and during these events, which each yielded entirely new categories of work. The vast majority of job categories recognized by the Bureau of Labor Statistics didn’t exist before the industrial revolution.
<h2>Humans Will Do Things Once</h2>
The history of technology is a story of democratizing access: the printing press spread ideas, industry scaled labor, and computers digitized knowledge. Each revolution forged entire industries around it. Today, AI makes human capability itself sharable.
<blockquote>“If you wish to achieve some kind of intellectual immortality, writing for the AIs is probably your best chance.” - Tyler Cowen</blockquote>
The value of human work will shift. Think about the difference between filing taxes once and teaching an AI model how to file taxes for you forever. The first is a variable cost, paid millions of times over by individuals and businesses. The second is a fixed cost; once we encode that knowledge, it can be applied an unlimited number of times.
<h2>Real-World Environments</h2>
Reinforcement learning (RL) is becoming so effective that it can saturate any eval, but academic benchmarks aren&#39;t reflective of the outcomes that consumers and enterprises care about. There is a sim-to-real gap in our benchmarks. Did the tax filing minimize liability? Did the medical advice improve patient outcomes? Did the lesson plan help students actually learn?
The real world has richer data rooms, more complex environments of applications and tools, and requests from both programmers and accountants. The frontier of model evaluation now lies in building richer environments: data rooms that mirror your Google Drive workspace, scaffolding that mimics the many applications you have on your laptop or phone, and reward functions that can assess the near-infinite number of actions you can take in the real world.
Models also need to be evaluated on longer-horizon tasks and collaborative environments: longitudinal patient cases assessed by boards of physicians, multi-party negotiations in M&amp;A deals, and risk-hedging as markets move through cycles.
<h2>An Expanding Frontier</h2>
The market for humans teaching models is based on the amount of tasks humans can do which agents can’t do. Many researchers who believe in the inevitability of ASI downplay the role of human data. Once AI exceeds humans in every task, they ask, why would human data matter? Will the pool of people able to contribute to model improvement shrink substantially?
We worked on a project where a team of 100 people worked to find mistakes made by a frontier agent while using a tool. They created rubrics to evaluate the model’s mistakes. At first, everyone easily stumped the model because it failed frequently. Six months later, only 20 people could still stump the model, reinforcing the case made by skeptics of human data.
We then added more tools that the agent could access and started pushing for longer-horizon tasks that would take humans over ten hours to complete. Suddenly, the model began failing across these challenges, and all 100 participants were able to once again contribute meaningfully to the project. As long as there are tasks in the economy that humans can perform but agents cannot, we will continue to need humans to create evaluations and train agents.
<h2>The Long-Term Outlook</h2>
Everyone is focused on the jobs AI might eliminate, such as copywriting, paralegal work, and medical billing. But not nearly enough attention is dedicated to the industry it will create, driven by people who will shape AI’s judgment, design its training environments, and ensure its outputs meet human standards.
We are entering the era of experience, with models learning to optimize for rewards in the real-world. Just as humans learn through the guidance of others, AI will require robust feedback. Professors create tests and rubrics to help us improve, while managers provide us with performance reviews to track how we’re doing in our jobs. The same type of scaffolding will be needed by the next generation of AI models.
The industrial revolution created a new class of workers who designed machines and kept them running. Similarly, the AI revolution will create a new class of workers tasked with guiding machines and democratizing access to their abilities. This is the great paradox: the future of AI is human.

While everyone fears job loss, we’re creating a new category of knowledge work faster than any other time in history. The future of work will converge on training agents.

The Economy will Become an RL Environment Machine

People often ask why I left a burgeoning career in law to work for Mercor. After all, I liked practice (which is, unfortunately, a rather rare sentiment). So why leave a stable career path that I’d spent half a decade building and move myself and my husband across the country to become a Strategic Projects Lead? The truth is that I became impatient: impatient for impact, impatient for growth, and impatient to be at the frontlines of a technology that was already changing the world, with far more to come.
Over the last few years, I saw intermittent glimpses of what AI could mean for humanity. It is a fundamental fact that most professions critical to the public good—healthcare, legal defense, teaching—are woefully understaffed. And, for the first time, it seemed like we were on the verge of a technological wave that could truly (and quickly) help fill those gaps. As a Second Circuit law clerk, I would sometimes compare pro se legal briefs (those written by individuals without an attorney, almost always because they cannot afford one) to the arguments AI models made if I fed them the (PII-scrubbed) relevant legal questions. While the models would occasionally hallucinate, their legal arguments were almost always better than those in the pro se briefs. They were, however, still far below the quality of legal analysis that the $1k+-an-hour lawyers at white shoe firms could produce. I got into the human data space—where people create the data that trains AI—because I wanted to help models deliver top-notch legal analysis, finally giving the Davids of the world the sling they need to take on the Goliaths.
And I joined Mercor specifically because of its incredible position leading a key transformation within that space: a shift from simple data labeling (e.g., deciding whether an image is of a stop sign or a child) to semantic intelligence work (where experts generate complex tasks, which can take 20+ hours to complete, based on their domain-specific knowledge). The company’s core thesis was that, to create next level AI models, you need great data and, to create great data, you need top human talent. Thus, Mercor’s founders focused on (1) building products that could effectively and efficiently screen for true domain experts (products which have use cases that <a href="https://mercor.com/blog/secret-master-plan/" target="_blank" rel="noopener noreferrer" class="text-indigo-600">go far beyond the human data space</a>) and (2) creating a deep bench of those experts across a broad range of domains.
The company’s success in this new era of semantic intelligence work has made my job as a Strategic Projects Lead incredibly exciting for three reasons: (1) we work for the best clients; (2) we engage the best experts; and (3) we become the best operators. Mercor now works with the <a href="https://x.com/BrendanFoody/status/1939783540394402083" target="_blank" rel="noopener noreferrer" class="text-indigo-600">top five AI labs and six of the Magnificent Seven</a>; many of the projects we run for these companies involve the world’s best and brightest, including Nobel Prize recipients, Emmy winners, Marshall and Rhodes Scholars, FAANG software engineers, and IMO medalists. To make these projects successful, Mercor also invests in world-class internal talent. My colleagues are among the sharpest and most committed people I’ve had the privilege of working with. And we’ve all grown into even stronger operators (and engineers) by taking on far more challenging and meaningful problems at Mercor than we could find elsewhere—backed by exceptional mentorship from leaders like the Mercor founders, who built a company that has engaged 10,000+ experts from 45+ countries in under two years, and Sundeep Jain, Mercor’s President and former Chief Product Officer at Uber.
To put things in perspective: in my second week at Mercor, I was tasked with staffing and running a project of 200+ software developers for one of the world’s top AI companies. This project not only created thousands of high quality data samples that will move the needle on AI progress—it also gave hundreds of people flexible, high-paid work for nearly five months, with a total of over $4,000,000 DPT (Dollars Paid to Talent). In my second quarter at Mercor, I interfaced almost daily with one of the top researchers at the same frontier lab to manage what the lab has referred to as one of the most ambitious human data projects it’s ever undertaken (notably led by an all female team, on both the lab’s side and Mercor’s). In sum, I can’t think of a better place to both learn about the human data space, with a bird’s eye view of how the best AI companies utilize human data to improve their models, and to challenge myself as an operator, responsible for supporting hundreds of experts and executing projects worth millions of dollars.
Looking back, the decision to leave law wasn’t about walking away from anything: it was about running toward something urgent and somewhere that I could flourish. I wanted to be in a place where innovation was the norm, where I could take full ownership from Day One, and where the work could scale beyond a single courtroom or client to help shape systems at the global level. Mercor has been that place.
For anyone standing at a similar crossroads who is curious about how they can contribute to something bigger, faster, and more impactful, I hope my story offers a glimpse of what’s possible. If you’re drawn to the same things that I am—a desire to have global impact; to work on the cutting edge of developments in AI; and to push the boundaries of your capabilities—I’d encourage you to check out <a href="https://mercor.com/careers" target="_blank" rel="noopener noreferrer" class="text-indigo-600">our open internal roles</a>.

People often ask why I left a burgeoning career in law to work for Mercor...

Why I Joined Mercor

Why I Joined Mercor – Nancy Fairbank

At Mercor, we connect graduate-level experts with meaningful, well-paid work that leverages their domain knowledge, while providing AI labs with access to our highly specialized and trusted network of talent. The experts on our platform train frontier AI models using their domain expertise by developing discipline-specific prompts, evaluating responses, and delivering high-quality human feedback across fields ranging from law, linguistics, engineering, and medicine.
For many graduate students and alumni, it’s both a valuable opportunity to apply their skills at the cutting edge of technology and a way to monetize their expertise without leaving academia behind. To invest in the future of academic research and human insight, we launched the Mercor Graduate Fellowship to support outstanding graduate students who are on-track to do exceptional work in their field.
This year, we received over 2,000 applications from across the country and are proud to announce our 50 finalists and two fellowship winners:
<ul><li>Neil Band, a Rhodes Scholar and third-year Computer Science PhD student at Stanford</li><li>Mireya Gonzales-Rivera, an incoming Physics PhD student at UC Berkeley, graduated from CSU San Marcos</li></ul>
The fellowship comes with a $50,000 grant and public recognition as part of our commitment to supporting graduate researchers and unlocking access to opportunity.
<h2>Meet the 2025 Winners</h2>
Neil Band
Neil is a third-year Computer Science PhD student at Stanford and a former Rhodes Scholar. His research focuses on making large language models more trustworthy and reliable, tackling problems such as hallucination, uncertainty calibration, and reasoning in LLMs. He currently works in one of Stanford’s top AI research labs, exploring how statistical thinking can be applied to modern generative systems.
With the support of this fellowship, Neil plans to take on more ambitious research, further contributing to the future of safe and reliable AI.
<iframe src="https://www.youtube.com/embed/Bl7ZWadMHT4?si=UnUogPXhCyEF9W49" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write; encrypted-media; web-share" referrerpolicy="strict-origin-when-cross-origin" title="YouTube video player" class="mb-8 mt-4 aspect-video w-full max-w-[1080px] rounded-lg bg-gray-100 object-cover sm:mt-6"></iframe>
Mireya Gonzales-Rivera
Mireya earned her bachelor’s in applied physics from California State University, San Marcos, where she conducted undergraduate research in quantum nanoelectronics and silicon-based qubit architectures. This fall, she’s beginning her Physics PhD at UC Berkeley, with a focus on quantum information science.
Mireya aims to build a lab that expands access to research and mentorship for underrepresented students in STEM. Her passion for equity, mentorship, and scientific discovery made her stand out in this year’s applicant pool as a clear example of undiscovered talent.
<iframe src="https://www.youtube.com/embed/xFyESncCyls?si=ZC-ycwAVY3aQ7pgH" frameborder="0" allow="autoplay; fullscreen; picture-in-picture; clipboard-write; encrypted-media; web-share" referrerpolicy="strict-origin-when-cross-origin" title="YouTube video player" class="mb-8 mt-4 aspect-video w-full max-w-[1080px] rounded-lg bg-gray-100 object-cover sm:mt-6"></iframe>
<h2>A Platform Built for Academic Talent</h2>
Through our platform, thousands of PhDs and postdocs are already contributing to advancements in AI and getting paid well for it. For some, it’s a full-time career path. For others, it’s a part-time opportunity to apply their expertise beyond the lab.
Whether through work or funding, our mission remains the same: to support graduate researchers, amplify their impact, and help them shape the future.

At Mercor, we connect graduate-level experts with meaningful, well-paid work that leverages their domain knowledge, while providing AI labs with access to our highly specialized and trusted network of talent.

Mercor Graduate Fellowship Winners

Reinforcement Learning (RL) is driving the most exciting advancements in AI. RL is becoming so effective that models will be able to saturate any evaluation. This means that the primary barrier to applying agents to the entire economy is building evals for everything. However, AI labs are facing a dire shortage of relevant evaluations. Academic evaluations that labs goal on don’t reflect what consumers and enterprises demand in the economy.
Evals are the new PRD. Progress in accelerating knowledge work will converge on building environments and evaluations that map real workspaces and deliverables. This new RL-centric paradigm of human data is vastly more data efficient than pretraining, SFT, or RLHF. Most knowledge work includes recurring workflows as variable costs, but creating an environment or evaluation can transform that into a one-time fixed cost.
<h2>Training on Verifiable Rewards</h2>
RL environments allow for rewarding outcomes and intermediate steps in an evaluation. Models take many attempts at a problem, using test-time compute to &quot;think&quot; before it answers. Human created autograders reward the attempts which were &quot;good&quot;. Reinforcing on those &quot;good&quot; trajectories upweights the chains of thought that were used to get to the answer. This teaches models to think correctly about different types of problems as researchers iteratively hill climb evals.
These environments can be thought of as existing on a spectrum of rigidity between two categories:
<ul><li>Objective domains: Games, like pac-man, chess, and Go, have clear states spaces, action spaces, and desired outcomes. Math, code, and even some tasks in biology, can often be formulated with near game-like verifiability. This is where RL has achieved early massive success already, notably, AlphaProof, AlphaFold, and DeepSeek R1 and the many code generation models on the market today.</li><li>Subjective domains: It’s more difficult to measure accuracy in many real world tasks such as generating investment memos, making legal briefs, providing therapy. This makes it difficult to verify that a model achieved desired outcomes. Additionally, experts often support multiple valid opinions about desired processes and outcomes. Rubric-based rewards serve as a way to learn from the messiness of expert human opinions. How to evaluate and train with rubrics as environments is an exciting area of research with roots laid as early as constitutional AI and RLAIF work from Anthropic.</li></ul>
Computer-use agents sit somewhere in the middle. For most of the tasks humans do on computers, goals start to become ambiguous and multi-faceted. Once defined, the actions and outcomes are programmatic and verifiable. These could include planning trips, responding to emails, shopping, or posting on social media. In all of these cases, containerized environments allow for horizontal scaling to learn online from thousands of interactions in parallel.
<h2>Environments Create Experience</h2>
Eventually, our AI systems will learn automatically from signals in the real world like pupils’ test scores increasing, sales closing, maybe even bridges being built. However, intermediate rewards will always remain critical. Similar to how humans learn from other people, models will need guidance on which styles of teaching and sales techniques are most effective. Humans will remain an integral part of the environments models learn from.
We will never escape the era of data; it must follow us to the frontier. That frontier is human created environments that provide durable sources of experiential data. These environments can serve to train and evaluate models.
<h2>The Path Forward</h2>
Meeting today’s data demand requires rethinking the way we generate signal from human efforts. Creating evals and RL environments is the highest leverage and most durable use of people’s time. Mercor has helped pioneer environment generation using autograders and continues to push the boundaries of RL data with simulated workspaces, multi-turn support, and multi-modality.
Knowledge work will quickly converge on building RL environments and evaluations for agents to learn from. As AI enters the workforce and operates over proprietary information and under unique professional contexts, these environments codify knowledge and goals for agents. Once individual steps of agentic workflows reach sufficient reliability, all that will be left will be RL training on the goals laid out by humankind.

Reinforcement Learning (RL) is driving the most exciting advancements in AI. RL is becoming so effective that models will be able to saturate any evaluation. This means that the primary barrier to applying agents to the entire economy is building evals for everything. However, AI labs are facing a dire shortage of relevant evaluations.

Welcome to The Era of Evals

Imagine a world where Jeff Bezos is a hedge fund investor, Howard Shultz is a salesman, and Reed Hastings is a teacher. That was the world we lived in, not so long ago. These are the jobs they were doing before they found the best use for their talents.
We founded Mercor because the labor market is the largest, most inefficient market in the world. Better matching people with the work they do everyday is the largest lever on maximizing global utility. While we gained incredible traction with our initial focus on contracting experts to train AI models, this is only the first step in our plan to solve global labor allocation.
The Wedge
Marketplaces are hard to get off the ground, but if they do take off they become huge. The successful ones have a wedge into a large and pressing unmet need. For Uber, the wedge was black cars. For Airbnb, it was conferences.
We started 2024 in our apartment with no US employees, under $1M in annual revenue, and only seed companies as customers. Last year we grew 6400% and we now work with the most sophisticated technology companies in the world, making us one of the fastest growing companies in Silicon Valley history.
We believe we can create hundreds of thousands of opportunities with AI labs alone, but that pales in comparison to the billions of knowledge work opportunities in the world. The technology that we’ve been building is generally applicable.
Structural Inefficiency
Labor inefficiency stems from two structural challenges in the market:
<ol><li>Fragmentation–Candidates apply to a handful of opportunities and companies consider a fraction of a percent of candidates in the market. This is because matching supply and demand needs to be solved manually (and previously in person). Companies manually review resumes, conduct interviews, and predict who they believe will perform well. Human time is the limiting factor. However, if you can solve this matching problem at the cost of software it allows you to interview everyone, making way for a global, unified labor market that every candidate applies to and every company finds talent from.</li><li>Imperfect Information–When you order a ride on Uber, you know what you’re getting. When you book an Airbnb, the pictures usually do a pretty good job. When you’re hiring someone, it’s extremely difficult to accurately predict how well they will perform. Imperfect human judgement is embedded within every transaction. While LLMs are not perfect at talent assessment, models are quickly surpassing human capabilities. This trend will continue to make transactions more efficient.</li></ol>
Correspondingly, our main objectives are to attract high caliber applicants to come to Mercor and accurately predict candidate’s job performance. Achieving these objectives will solve global labor efficiency more broadly.
Contracting experts to train AI models is the perfect forcing function on these objectives. First, we collect performance data from AI labs within days, compared to the 3 month lag from a traditional enterprise. This allows us to immediately calibrate on the effectiveness of our models and continuously experiment to find the features predictive of success. Second, we need to engage with a broad pool of candidates across all knowledge work roles (law, consulting, medicine, engineering, etc.). This builds the strength of our talent pool across every professional and academic domain. Third, we will service “unreasonable asks” from AI labs like needing to find 300 people in two days. These high volume requests for quality people on extremely short timelines can’t be fulfilled with a services operation. They force us to build the automations at each layer of the engagement process to deliver.
Hiring for All Work
We have the largest comparative advantage from automating talent assessment when the ratio of time spent assessing someone relative to the time spent working with them is the highest. When hiring someone for 5 years, it’s easier to interview them manually. When hiring someone for 5 weeks, efficient matching automation creates a huge comparative advantage. Because of this, we’ve started with shorter duration contract work, but will expand progressively towards longer duration, full-time jobs as our technology matures.
So, in short, the master plan is:
<ol><li>Hire people to train AI models</li><li>Use those contracts to learn how to predict job performance</li><li>Expand to short-duration contract roles</li><li>Hire people for all jobs</li></ol>
Don’t tell anyone.

Imagine a world where Jeff Bezos is a hedge fund investor, Howard Shultz is a salesman, and Reed Hastings is a teacher. That was the world we lived in, not so long ago. These are the jobs they were doing before they found the best use for their talents.

The Secret Mercor Master Plan

Throughout history, many of humanity’s great minds took a while to find their calling. Jeff Bezos started as a hedge fund investor. Walt Disney was fired as a newspaper editor for “lacking creativity.” Before becoming a priest, Pope Francis worked as a nightclub bouncer. (Yes, really.) In each case, greatness was unlocked by the right person ending up in the right place at the right time. But, tragically, many never do.
The labor market has consistently been the largest, most inefficient market in the world. That unsolved problem becomes immediate and urgent now as we face the advent of AGI and prepare for the most transformative and challenging period in how human talent is allocated.
AI will change and even displace many traditional careers. At the same time, in other professions, AI will actually create much more need and opportunity for human ability.
We’re not the first to recognize these changes. But we’re the first to build a scalable solution. Bringing the right talent to these roles — defining “skilled labor” in the AI era — is where we focus our efforts.
Our current focus is on contracting experts to help advance the frontier of AI, but our sights are on every sector.
Mercor is two years old and growing more than 51% month over month. Our team includes the former Head of Human Data Operations at OpenAI and the previous Head of Growth at Scale. Over half of us are former founders, and our median age is 22.
Our latest round was led by Felicis, with participation from General Catalyst, DST Global Partners, Benchmark, and Menlo Ventures — and it will accelerate our ability to match billions of people with their calling, applying human talent to its highest potential.
If this mission resonates with you, we are looking for exceptional individuals — elite engineers, seasoned operators, and former founders — to join us. Because it takes talent to recognize talent. Reach out: <a href="mailto:careers@mercor.com" target="_blank" rel="noopener noreferrer" class="text-indigo-600">careers@mercor.com</a>.

Mercor is the global allocator for extraordinary human talent in the AI economy, and one of the fastest growing companies in Silicon Valley history. We’ve just raised a $100M Series B to solve the hardest problem in capitalism: matching human ability to its greatest use.

Announcing Mercor's Series B

Today, we're excited to announce $3.6M in funding and the launch of our fully-automated platform, which uses AI to assess and match talent with companies. Our round is led by General Catalyst and includes participation from Scott Sandell (Chairman, CEO and CIO of NEA) and others.

Mercor Blog

Agent Eval Systems