Dec 12, 2025Company

Expanding the AI Productivity Index (APEX)

Brendan Foody
Brendan FoodyCo-founder / CEO

We created the first version of the AI Productivity Index (APEX) to assess whether frontier models can perform high-value work in Investment banking, Management consulting, Law, and Medicine. We found that even the very best models struggle on complex real-world tasks, failing to meet the production bar.

Mercor has now doubled the size of the heldout evaluation set in APEX from n=200 to n=400. This larger eval set allows us to more consistently evaluate frontier models’ ability to perform tasks that create economic value. The design of the cases remains the same (comprising prompts and source documents, and a grading rubric) but we have increased their complexity and variety. On average, tasks in APEX take over two-and-half hours for seasoned professionals to complete in the real-world. The contributors to APEX typically had over 7 years of experience.

We have also refined the evaluation methodology in APEX, increasing the number of runs we execute to 8, simplifying the grading process, and adding confidence intervals to our results. Read the technical report to find out more about our approach to evaluation and how the dataset was created.

Results

The best performing model (GPT 5, Thinking = High) has the highest mean score at 67%, followed by Gemini 3 Pro (Thinking = High) at 64.3% and Grok 4 at 63.5%. Models' scores are lowest on Investment banking, with the highest scoring model achieving 63.0%, followed by Management consulting (top score = 64%), Medicine (top score = 65.5%), and, with substantially higher scores, Law (top score = 77.9%). APEX shows that models struggle with real-world tasks that professionals undertake everyday. The new leaderboard is available here.

Gemini 3 Pro (Thinking = High), released in November 2025, is a substantial improvement on 2.5, beating it out by over 5 percentage points overall, and on Investment banking by 9 percentage points. Similarly, we see Opus 4.5 (Thinking = On), also released in November, beating Sonnet 4.5 and Opus 4.1 by 6 percentage points and 12 percentage points respectively. These substantial improvements speak to a meaningful step forward in models’ ability to perform high-value tasks. GPT5.1 is an exception and does not improve on GPT 5 (which remains the leaderboard champion), but this is perhaps not surprising given that 5.1 is primarily meant to be more conversational and explanative, rather than better at complex reasoning. Finally, given it is the only model without an explicit reasoning setting, Grok 4, is remarkably strong and comes 4th overall.

If you want to add your model to the leaderboard or run a loss analysis, contact the Mercor Applied AI Research team at [email protected].

Open source eval set

We are releasing n=100 open-source cases (APEX-v1-devset) on Hugging Face with a CC-BY licence for anyone to train on, evaluate against, and research. The open-source cases were created by the same annotators through the same production pipeline. We are also open-sourcing our evaluation harness so you can exactly match our grading approach. We are looking forward to seeing what the community builds and would love to hear from any researchers using our data.