As of December 2025, the SOTA is around 81% for Claude 4.5 Opus, so this threshold probably will not be passed until 2026. Still, it does not seem far off.
Also, GPT-5.1-Codex-Max has a longer task length than I expected (perhaps because it is specifically for coding? But it seems there are always more tricks to maintain exponential growth—is this sustainable?).
On balance, I increasingly trust “straight lines” like METR task length to hold up in the short-medium term, simply because they have held up reliably without speeding up or slowing down (so perhaps I will lose my bet with @Daniel Kokotajlo). But even exponential growth is somewhat smooth, which seems consistent with my model’s prediction that agency is hard. The evidence is (subjectively) weird—we are too ignorant about how LLMs work to make principled predictions. And I seem to have an unhealthy (awarenessof my) reputation as lesswrong LLM skeptic, when in fact I am often confused and hold my beliefs on this rather weakly.
Thanks for following up! Yeah at some point (perhaps January?) we should do a blog post retrospective enumerating all the forecasts we made in AI 2027 and comparing them to what actually happened. My general sense right now is that progress has been somewhat slower than AI 2027 expected, and even slower than I expected at the time (my median was 2028 at the time) but not dramatically slower. It would be good to quantify this.
Updates about LLM agency.
The AI 2027 forecast for mid-2025 scores on SWE-bench was not correct:
(From the footnotes here.)
As of December 2025, the SOTA is around 81% for Claude 4.5 Opus, so this threshold probably will not be passed until 2026. Still, it does not seem far off.
Also, GPT-5.1-Codex-Max has a longer task length than I expected (perhaps because it is specifically for coding? But it seems there are always more tricks to maintain exponential growth—is this sustainable?).
On balance, I increasingly trust “straight lines” like METR task length to hold up in the short-medium term, simply because they have held up reliably without speeding up or slowing down (so perhaps I will lose my bet with @Daniel Kokotajlo). But even exponential growth is somewhat smooth, which seems consistent with my model’s prediction that agency is hard. The evidence is (subjectively) weird—we are too ignorant about how LLMs work to make principled predictions. And I seem to have an unhealthy (awareness of my) reputation as lesswrong LLM skeptic, when in fact I am often confused and hold my beliefs on this rather weakly.
Thanks for following up! Yeah at some point (perhaps January?) we should do a blog post retrospective enumerating all the forecasts we made in AI 2027 and comparing them to what actually happened. My general sense right now is that progress has been somewhat slower than AI 2027 expected, and even slower than I expected at the time (my median was 2028 at the time) but not dramatically slower. It would be good to quantify this.