Introducing the APEX-agents benchmark






Mercor is excited to release the APEX-agents Benchmark, the most realistic and comprehensive benchmark for AI agents executing professional services work.
APEX-agents has evals for three high-value domains: Management Consulting, Investment Banking, and Law. In total there are 315 tasks, with each domain containing 105 tasks split across 7 worlds.
We are also open sourcing 90 tasks on Hugging Face (30 for each domain), and making Archipelago, our infra service for evaluation and inference available open source. To explore the data you can see the open source worlds, tasks and sample eval results at opensource.studio.mercor.com
AI agents present a step change in how professional work is carried out, creating a huge boom in productivity. However, despite their impressive capabilities, and steady improvement on existing benchmarks, we see a large gap from benchmarks to the real world. The APEX Survey revealed how many professionals still…
Partnering with Box, Harvey and <tbc>, with APEX-agents, we set out to build an agentic eval that closes the Sim2Real gap, giving trustworthy signal into agents' true capabilities. Over XX experts worked on the benchmark, logging thousands of hours of work and utilizing tens of thousands of files, messages, and documents.
We tested 10 frontier models against APEX-agents. The results show <full results>.
The full APEX-agents leaderboard is available here. If you want to add your agent to the leaderboard contact the Mercor Applied AI Research team at [email protected].
Our methodology
We approach Agentic evaluation in three steps, informed by the APEX Survey.
- Build a realistic world. We instruct experts to create a rich, complex world, complete with research materials, meeting notes, interim outputs, final deliverables, chat exchanges, calendar invites, and external-facing emails. When building the world, experts adopt in-world personas, such as the customer, partner, project manager, and juniors. They also keep a strict distinction between in-world activity and out-of-world activity (e.g., discussing their work with Mercor). Every world is seeded with a rich project scenario, outlining the key players (e.g., the customer and the delivery company) and the primary objectives and constraints.
To view one of the worlds, see the open source worlds available here: <Consulting World 102>.
- Implement required apps. To successfully perform work in the world, agents need access to the apps that experts use day-to-day. Based on our expert survey, we identified the most important apps in each domain and implemented them in RLStudio. For instance, in management consulting, the primary apps are: Sheets, Slides, Docs, Email, Chat, Calendar, File Explorer.
Each app is associated with a set of functions (e.g., “Edit text” or “Read Table”). Over time, we plan on adding more apps and more functions, allowing our partners to run ablations on how changing tool access impacts agent performance.
- Create tasks and verifiers. Once the world has been built and the apps implemented, experts create realistic, challenging and diverse tasks. Every task can only be executed by using information from within the world, and typically requires complex reasoning and advanced planning.
Closing remarks
Agents may be one of the most important technological advances in history, helping to translate the incredible potential of AI into real-world value. But without the right evals, it is hard to track progress and pull the right levers to improve performance. We hope that APEX-agents can support developers in creating useful models that enhance and improve the daily work of billions of professionals around the world.
