The AI Productivity Index for Software Engineers (APEX-SWE) measures whether frontier AI systems can execute economically valuable software engineering work. It has two domains, covering Integration and Observability tasks.
We created APEX-SWE to evaluate the real day-to-day work of software engineers, unlike unit-level and single-repository bug-fix benchmarks. It comprises n=200 cases and spans two complementary settings that mirror professional SWE work: (1) Integration tasks that require end-to-end system construction and deployment across heterogeneous services and (2) Observability tasks that require debugging with production-style telemetry.
Each task includes a human-authored rubric that grades agent outputs for functional requirements, robustness, and code style, alongside unit tests.
To support open research, we have open-sourced n=50 cases that are in-distribution of APEX-SWE on Hugging Face with all metadata labels. We have also shared our eval harness for reproducibility.
Experts from Zapier, MuleSoft, Workato, Segment
View more
GPT 5 (High)
78%
Sonnet 4.5 (On)
75%
Gemini 2.5 Pro (On)
73%
Experts from Datadog, New Relic, Splunk, Grafana
View more
GPT 5 (High)
80%
Sonnet 4.5 (On)
77%
Gemini 2.5 Pro (On)
74%