Thomas Kwa comments on How to game the METR plot

Thomas Kwa 25 Dec 2025 1:54 UTC
11 points
3
It goes without saying METR didn’t simply stipulate ’linear score improvement = exponentially increasing time horizons”: it arose from a lot of admirable empirical work demonstrating the human completion time is roughly log-distributed.
Not sure I agree with this, we constructed the benchmark to span a wide range, and the empirical work was mostly to show that the model success rate curve was logistic in human time.
But at least when taken as the colloquial byword for AI capabilities, this crucial contour feels a bit too mechanistic to me. I take that you can generalise the technique widely to other benchmarks deepens rather than alleviates this concern: if human benchmarking exercises would give log-distributed horizons across the items in many (/most?) benchmarks, such that progressive linear increments in model performance would give a finding of exponentially improving capabilities, maybe too much is being proven.
This isn’t always the case. What matters for a benchmark is a range that spans many orders of magnitude, not exactly log distribution, and in that report, we saw that many benchmarks generally spanned narrower ranges than the METR task suite, so there isn’t enough data to prove that increments in model performance. In others, task length was poorly correlated with difficulty for models. But taken together, the fact that models went from 0% to saturating MATH500, then the AIME, and now IMO points to a dramatic increase in capabilities.
I also think our everyday experience points to something like an exponential relationship between an intuitive “task complexity rating” and human time. It’s natural to think of a level n+1 task as being decomposable into several level n tasks (eg get groceries = drive to the store, shop, drive back from the store) which naturally gives you exponential.
It doesn’t seem to me the frontier has gotten ~3x more capable over this year, and although I’m no software engineer, it doesn’t look from the outside like e.g. Opus 4.5 is 2x better at SWE than Opus 4.1, etc.
Some of this is probably the benchmark being unrepresentative of real tasks, but it’s not clear why an agent with 2x the time horizon should feel 2x better. When using an assistant with double the time horizon on real tasks, you need to intervene half as much, but each intervention takes you longer, since it’s written about double the code, fails in more complicated ways, and you have less understanding of what it’s doing. Combined with Amdahl’s law effects, I wouldn’t be surprised if doubling the time horizon only causes a speedup of 1.2x or so on average on tasks much longer than the agent’s time horizon, which are still most of them.
What links here?
- Thomas Kwa's comment on Thomas Kwa’s Shortform by Thomas Kwa (15 Jan 2026 4:39 UTC; 133 points)