I found this tweet helpful that does the same regression on another dataset—chess—and arrives at an absurd conclusion. For me, the result is that LLMs may soon be able to handle very big software engineering tasks, but that will likely not generalize to arbitrary tasks. Longer more general tasks might still follow soon after but you can’t reliably predict this with this single dataset alone.
I don’t think I get it. If I read this graph correctly, it seems to say that if you let a human play chess against an engine and want it to achieve equal performance, then the amount of time the human needs to think grows exponentially (as the engine gets stronger). This doesn’t make sense if extrapolated downward, but upward it’s about what I would expect. You can compensate for skill by applying more brute force, but it becomes exponentially costly, which fits the exponential graph.
It’s probably not perfect—I’d worry a lot about strategic mistakes in the opening—but it seems pretty good. So I don’t get how this is an argument against the metric.
It is a decent metric for chess but a) it doesn’t generalize to other tasks (as people seem to interpret the METR paper), and less importantly, b) I’m quite confident that people wouldn’t beat the chess engines by thinking for years.
I mean, beating a chess engine in 2005 might be a “years-long task” for a human? The time METR is measuring is how long it would hypothetically take a human to do the task, not how long it takes the AI.
And even if—they try to argue in the other direction: If it takes the human time X at time T it will take the AI duration L. That didn’t work for chess either.
I found this tweet helpful that does the same regression on another dataset—chess—and arrives at an absurd conclusion. For me, the result is that LLMs may soon be able to handle very big software engineering tasks, but that will likely not generalize to arbitrary tasks. Longer more general tasks might still follow soon after but you can’t reliably predict this with this single dataset alone.
I don’t think I get it. If I read this graph correctly, it seems to say that if you let a human play chess against an engine and want it to achieve equal performance, then the amount of time the human needs to think grows exponentially (as the engine gets stronger). This doesn’t make sense if extrapolated downward, but upward it’s about what I would expect. You can compensate for skill by applying more brute force, but it becomes exponentially costly, which fits the exponential graph.
It’s probably not perfect—I’d worry a lot about strategic mistakes in the opening—but it seems pretty good. So I don’t get how this is an argument against the metric.
It is a decent metric for chess but a) it doesn’t generalize to other tasks (as people seem to interpret the METR paper), and less importantly, b) I’m quite confident that people wouldn’t beat the chess engines by thinking for years.
What is the absurd conclusion?
That we would have AIs performing year-long tasks in 2005. Chess is not the same as software engineering but it is still a limited domain.
I mean, beating a chess engine in 2005 might be a “years-long task” for a human? The time METR is measuring is how long it would hypothetically take a human to do the task, not how long it takes the AI.
Yes, but it didn’t mean that AIs could do all kinds of long tasks in 2005. And that is the conclusion many people seem to draw from the METR paper.
No? It means you can’t beat the chess engine.
And even if—they try to argue in the other direction: If it takes the human time X at time T it will take the AI duration L. That didn’t work for chess either.