Back to APEX Family

The AI Productivity Index for SWEs

The AI Productivity Index for Software Engineers (APEX-SWE) measures whether frontier AI systems can execute economically valuable software engineering work. It has two domains, covering Integration and Observability tasks.

The APEX-SWE leaderboard

We created APEX-SWE to evaluate the real day-to-day work of software engineers, unlike unit-level and single-repository bug-fix benchmarks. It comprises n=200 cases and spans two complementary settings that mirror professional SWE work: (1) Integration tasks that require end-to-end system construction and deployment across heterogeneous services and (2) Observability tasks that require debugging with production-style telemetry.

Each task includes a human-authored rubric that grades agent outputs for functional requirements, robustness, and code style, alongside unit tests.

To support open research, we have open-sourced n=50 cases that are in-distribution of APEX-SWE on Hugging Face with all metadata labels. We have also shared our eval harness for reproducibility.

Domains covered in APEX-SWE