Introducing CRUX
Open-world evaluations for measuring frontier AI capabilities
The case for long, messy, real-world tasks to evaluate AI agents
What is CRUX?
CRUX (Collaborative Research for Updating AI eXpectations) is a project for systematically conducting open-world evaluations. We find tasks where people genuinely disagree about what agents can do, evaluate agents on those tasks, and use the results to update expectations about AI capabilities.
Each evaluation involves a long-horizon, real-world task; an agent scaffold that could in theory allow agents to solve the task; detailed log analysis; and a write-up that includes interpretations from collaborators with diverse perspectives.
We plan to release new evaluations every 1–2 months. The next CRUX evaluation will focus on AI R&D tasks.
The problem
Benchmarks saturate quickly and can’t capture the messiness of real-world tasks. Whatever is precise enough to benchmark is also precise enough to optimize for.
The approach
Open-world evaluations: small numbers of long-horizon tasks in real-world settings, with detailed log analysis and human intervention to elicit upper-bound capabilities.
CRUX
A collaborative project to systematically conduct open-world evaluations, with new experiments every 1–2 months across AI R&D, governance, and more.
As AI systems become more capable, evaluators must accept tradeoffs between evaluations that are constrained and scalable, and evaluations that are noisy and realistic. Open-world evaluations represent one end of this spectrum.
Survey of evaluations
Over the past year, researchers at AI labs, universities, non-profits, and independent groups have begun running open-world evaluations. Collecting and comparing them helps identify what makes an evaluation informative, surfaces common patterns in how agents succeed and fail, and builds toward a cumulative body of evidence about what agents can and cannot do.
We survey ten examples below. Click any entry to see harness details, costs, and what we learned.
Collaborators
- Sayash KapoorPrinceton University
- Peter KirgisPrinceton University
- Andrew SchwartzCornflower Labs
- J.J. AllaireMeridian Labs
- Rishi BommasaniStanford University
- Magda DuboisUK AI Safety Institute
- Gillian HadfieldJohns Hopkins University
- Andy HallStanford University
- Sara HookerAdaption Labs
- Seth LazarAustralian National University, Johns Hopkins University
- Steve NewmanGolden Gate Institute for AI
- Dimitris PapailiopoulosUW Madison
- Stephan RabanserPrinceton University
- Shoshannah TekofskyAI Village
- Helen TonerGeorgetown University (CSET)
- Cozmin UdudecUK AI Safety Institute
- Arvind NarayananPrinceton University