Cole Wyeth comments on METR: Measuring AI Ability to Complete Long Tasks

Cole Wyeth 20 Mar 2025 20:20 UTC
12 points
7
The problem is deeper than that.
Playing a game of chess takes hours. LLMs are pretty bad it, but we have had good chess engines for decades—why isn’t there a point way off on the top left for chess?
Answer: we’re only interested in highly general AI agents, which basically means LLMs. So we’re only looking at the performance of LLMs, right? But if you only look at LLM performance without scaffolding, it looks to me like that asymptotes around 15 minutes. Only by throwing in systems that use a massive amount of inference time compute do we recover a line with a consistent upwards slope. So we’re allowed to use search, just not narrow search like chess engines. This feels a little forced to me—we’re putting two importantly different things on the same plot.
Here is an alternative explanation of that graph: LLMs have been working increasingly well on short tasks, but probably not doubling task length every seven months. Then after 2024, a massive amount of effort poured into trying to make them do longer tasks by paying up a high cost in inference time compute and very carefully designed scaffolding, with very modest success. It’s not clear that anyone has another (good) idea.

With that said, if the claimed trend continues for another year (now that there are actually enough data points to usefully draw a line through) that would be enough for me to start finding this pretty convincing.
- Asta7k 13 Apr 2025 13:06 UTC
  1 point
  0
  Parent
  But then again, it seems like we wouldn’t be able to create accurate plots with any model, since models are inherently different, and each one has slight architectural variations. Even the 2024–2025 plot isn’t entirely accurate, as the models it includes also differ to some extent. Comparing LLMs to LRMs (Large Reasoning Models) is simply a natural step in their evolution, these models will always continue to develop.