APEX Benchmarks

The APEX family of benchmarks assesses whether frontier AI models can perform economically valuable tasks across professional services, medicine, software engineering, and consumer activities.

Get in touch

APEX-Agents

The AI Productivity Index for Agents (APEX-Agents) measures whether frontier AI agents can execute long-horizon, cross-application tasks across three jobs in professional services.

Blog Paper Data Code Sample task

GPT 5.4 (xHigh)

36.0% ± 3.8%

GPT 5.2 (xHigh)

34.4% ± 3.8%

Gemini 3.1 Pro (High)

33.5% ± 3.6%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

The AI Productivity Index for Software Engineers (APEX-SWE) measures whether frontier AI agents can execute high-value engineering work, split across Observability and Integration tasks. Created by Mercor in collaboration with Cognition.

Blog Paper Data Code Sample task

GPT 5.3 Codex (High)

41.5% ± 6.3%

Opus 4.6 (High)

40.5% ± 6.3%

Opus 4.5 (High)

38.7% ± 6.3%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

APEX

The AI Productivity Index (APEX) assesses whether frontier models are capable of performing economically valuable tasks across four jobs: investment banking associate, management consultant, big law associate, and primary care physician (MD).

Blog Paper Data Code Sample task

GPT 5.4 (High)

67.2% ± 2.4%

Opus 4.6 (Max)

65.7% ± 2.6%

Opus 4.6 (High)

65.3% ± 2.7%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

ACE

The AI Consumer Index (ACE) assesses whether frontier AI models can perform everyday consumer tasks in shopping, food, gaming, and DIY.

Blog Paper Data Code Sample task

GPT 5 (High)

56.1% ± 3.3%

o3 Pro (High)

55.2% ± 3.2%

GPT 5.1 (High)

55.1% ± 3.2%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%