Charlie Steiner comments on Recent progress on the science of evaluations

Charlie Steiner 11 Sep 2025 17:13 UTC
LW: 2 AF: 2
0
AF
Thanks, just watched a talk by Luxin that explained this. Two questions.
- Currently the models’ ability scores seem to be pretty-staightforward average score as a function of the annotated level of the task on that ability. But it would be useful to be able to infer ability scores without too much data from the actual task. Could you do something clever like estimate just from a token probabilities beam search on a few proxy questions?
- The headline number for predictiveness looks good. But does the predictiveness have systematic errors on different benchmarks? I.e. for tasks from different benchmarks with the same annotated requirements, are some benchmarks systematically harder or easier?