Cole Wyeth comments on Cole Wyeth’s Shortform

Cole Wyeth 1 Dec 2025 21:17 UTC
8 points
0
Updates about LLM agency.
The AI 2027 forecast for mid-2025 scores on SWE-bench was not correct:
For example, we think coding agents will move towards functioning like Devin. We forecast that mid-2025 agents will score 85% on SWEBench-Verified.
(From the footnotes here.)
As of December 2025, the SOTA is around 81% for Claude 4.5 Opus, so this threshold probably will not be passed until 2026. Still, it does not seem far off.
Also, GPT-5.1-Codex-Max has a longer task length than I expected (perhaps because it is specifically for coding? But it seems there are always more tricks to maintain exponential growth—is this sustainable?).
On balance, I increasingly trust “straight lines” like METR task length to hold up in the short-medium term, simply because they have held up reliably without speeding up or slowing down (so perhaps I will lose my bet with @Daniel Kokotajlo). But even exponential growth is somewhat smooth, which seems consistent with my model’s prediction that agency is hard. The evidence is (subjectively) weird—we are too ignorant about how LLMs work to make principled predictions. And I seem to have an unhealthy (awareness of my) reputation as lesswrong LLM skeptic, when in fact I am often confused and hold my beliefs on this rather weakly.
- Daniel Kokotajlo 2 Dec 2025 1:05 UTC
  11 points
  4
  Parent
  Thanks for following up! Yeah at some point (perhaps January?) we should do a blog post retrospective enumerating all the forecasts we made in AI 2027 and comparing them to what actually happened. My general sense right now is that progress has been somewhat slower than AI 2027 expected, and even slower than I expected at the time (my median was 2028 at the time) but not dramatically slower. It would be good to quantify this.