Overview AI evaluations are at a critical inflection point. Static benchmarks are saturated; benchmarks like MMLU, HumanEval, and SWE-Bench have reached their limits as models become increasingly familiar with public test data, and capable of autonomously finding answer keys online. The gap between