It’s useful for evals to be run reliably for every model and maintained for long periods. A lot of the point of safety-relevant evals is to be a building block people can use for other things: they can make forecasts/bets about what models will score on the eval or what will happen if a certain score is reached, they can make commitments about what to do if a model achieves a certain score, they can make legislation that applies only to models with specific scores, and they can advise the world to look to these scores to understand if risk is high.
Much of that falls apart if there’s FUD about whether a given eval will still exist and be run on the relevant models in a year’s time.
This didn’t used to be an issue because evals used to be simple to run; they were just a simple script asking a model a series of multiple-choice questions.
Agentic evals are complex. They require GPUs and containers and scripts that need to be maintained. You need to scaffold your agent and run it for days. Sometimes you need to build a vending machine.
I’m worried about a pattern where a shiny new eval is developed, run for a few months, then discarded in favor of newer, better evals. Or where the folks running the evals don’t get around to running them reliably for every model.
As a concrete example, the 2025 AI Forecasting Survey asked people to forecast what the best model’s score on RE-Bench would be by the end of 2025, but RE-Bench hasn’t been run on Claude Opus 4.5, or on many other recent models (METR focuses on their newer, larger time-horizon eval instead). It also asked for forecasted scores on OS-World, but OS-World isn’t run anymore (it’s been replaced by OSWorld-Verified).
There are real costs to running these evals, and when they’re deprecated, it’s usually because they’re replaced with something better. But I think sometimes people act like this is a completely costless action and I want to point out the costs.
It’s useful for evals to be run reliably for every model and maintained for long periods. A lot of the point of safety-relevant evals is to be a building block people can use for other things: they can make forecasts/bets about what models will score on the eval or what will happen if a certain score is reached, they can make commitments about what to do if a model achieves a certain score, they can make legislation that applies only to models with specific scores, and they can advise the world to look to these scores to understand if risk is high.
Much of that falls apart if there’s FUD about whether a given eval will still exist and be run on the relevant models in a year’s time.
This didn’t used to be an issue because evals used to be simple to run; they were just a simple script asking a model a series of multiple-choice questions.
Agentic evals are complex. They require GPUs and containers and scripts that need to be maintained. You need to scaffold your agent and run it for days. Sometimes you need to build a vending machine.
I’m worried about a pattern where a shiny new eval is developed, run for a few months, then discarded in favor of newer, better evals. Or where the folks running the evals don’t get around to running them reliably for every model.
As a concrete example, the 2025 AI Forecasting Survey asked people to forecast what the best model’s score on RE-Bench would be by the end of 2025, but RE-Bench hasn’t been run on Claude Opus 4.5, or on many other recent models (METR focuses on their newer, larger time-horizon eval instead). It also asked for forecasted scores on OS-World, but OS-World isn’t run anymore (it’s been replaced by OSWorld-Verified).
There are real costs to running these evals, and when they’re deprecated, it’s usually because they’re replaced with something better. But I think sometimes people act like this is a completely costless action and I want to point out the costs.