Introducing CRUX

Open-world evaluations for measuring frontier AI capabilities

The case for long, messy, real-world tasks to evaluate AI agents

What is CRUX?

CRUX (Collaborative Research for Updating AI eXpectations) is a project for systematically conducting open-world evaluations. We find tasks where people genuinely disagree about what agents can do, evaluate agents on those tasks, and use the results to update expectations about AI capabilities.

Each evaluation involves a long-horizon, real-world task; an agent scaffold that could in theory allow agents to solve the task; detailed log analysis; and a write-up that includes interpretations from collaborators with diverse perspectives.

We plan to release new evaluations every 1–2 months. The next CRUX evaluation will focus on AI R&D tasks.

The problem

Benchmarks saturate quickly and can’t capture the messiness of real-world tasks. Whatever is precise enough to benchmark is also precise enough to optimize for.

The approach

Open-world evaluations: small numbers of long-horizon tasks in real-world settings, with detailed log analysis and human intervention to elicit upper-bound capabilities.

CRUX

A collaborative project to systematically conduct open-world evaluations, with new experiments every 1–2 months across AI R&D, governance, and more.

As AI systems become more capable, evaluators must accept tradeoffs between evaluations that are constrained and scalable, and evaluations that are noisy and realistic. Open-world evaluations represent one end of this spectrum.

Simple
Complex
1
Multiple-choice Q&A
+ Broad knowledge assessment, scalable, reproducible
Low construct validity; users rarely interact with models via multiple-choice. Increasingly saturated for frontier models.
2
Open-ended chat
+ Captures nuance in free-form responses
Limited to single-turn or short interactions. Cannot measure long-horizon planning or tool use.
3
Outcome-only agent benchmarks
+ Tests agent performance on real, well-defined tasks
Only measures whether the task was completed, not how. Most passing SWE-Bench solutions are not accepted by maintainers.
4
Agent benchmarks with log analysis
+ Examines how agents succeed or fail, uncovering reward hacking
Still operates in sandboxed environments with predefined tasks. Cannot capture real-world messiness.
5
Open-world evaluations
+ Long-horizon, real-world tasks that elicit upper-bound capabilities
Not reproducible or standardized. Hard to compare across agents. Success criteria can be blurry.

Survey of evaluations

Over the past year, researchers at AI labs, universities, non-profits, and independent groups have begun running open-world evaluations. Collecting and comparing them helps identify what makes an evaluation informative, surfaces common patterns in how agents succeed and fail, and builds toward a cumulative body of evidence about what agents can and cannot do.

We survey ten examples below. Click any entry to see harness details, costs, and what we learned.

Collaborators

  • Sayash Kapoor
    Princeton University
  • Peter Kirgis
    Princeton University
  • Andrew Schwartz
    Cornflower Labs
  • J.J. Allaire
    Meridian Labs
  • Rishi Bommasani
    Stanford University
  • Magda Dubois
    UK AI Safety Institute
  • Gillian Hadfield
    Johns Hopkins University
  • Andy Hall
    Stanford University
  • Sara Hooker
    Adaption Labs
  • Seth Lazar
    Australian National University, Johns Hopkins University
  • Steve Newman
    Golden Gate Institute for AI
  • Dimitris Papailiopoulos
    UW Madison
  • Stephan Rabanser
    Princeton University
  • Shoshannah Tekofsky
    AI Village
  • Helen Toner
    Georgetown University (CSET)
  • Cozmin Ududec
    UK AI Safety Institute
  • Arvind Narayanan
    Princeton University