The AI Productivity Index (APEX) assesses whether frontier models are capable of performing economically valuable tasks across four jobs: investment banking associate, management consultant, big law associate, and primary care physician (MD).
We created APEX to bridge the gap between what professionals want from AI systems and what benchmarks test for. The prompts are realistic, challenging and diverse, and are provided with source documents. Each case was created by a veteran industry expert to capture their day-to-day work.
The leaderboard is based on a hidden heldout set of 400 tasks (n=100 per job). For each task we collect responses from each model 8 times. We grade them using a Judge LM and report the mean value.
To support open research, we have open-sourced n=100 cases that are in-distribution of APEX on Hugging Face. We have also shared our eval harness for reproducibility
Advised by Dominic Barton—former McKinsey Global Managing Director and Canadian Ambassador to China.
Experts from McKinsey, BCG, Deloitte, Accenture, EY
View more
Gemini 3 Pro (High)
64%
GPT 5 (High)
63%
Grok 4
60%
Experts from Goldman Sachs, Morgan Stanley, JPMorgan, Barclays, UBS, Bank of America, Evercore
View more
Gemini 3 Pro (High)
63%
GPT 5 (High)
61%
Grok 4
60%
Advised by Cass Sunstein—Harvard law professor, former White House Regulatory Administrator, and top-cited legal scholar.
Experts from Latham & Watkins, Skadden, Cravath
View more
GPT 5 (High)
78%
GPT 5.1 (High)
77%
o3 (On)
76%
Advised by Eric Topol—Cardiologist, geneticist, and founder of the Scripps Research Translational Institute, leading voice in digital and precision medicine.
Experts from University of Pennsylvania, Northwestern, Cornell, Brigham & Women’s, Mount Sinai
View more
GPT 5 (High)
66%
Opus 4.5 (On)
65%
Grok 4
64%