It seems to me that AIs have remained stubbornly terrible at this from GPT-3 to GPT-4 to Sonnet 3.5.1 to o1[2]; that the improvement on this hard-to-specify quality has been ~0.
Huh, I disagree reasonably strongly with this. Possible that something along these lines is an empirically testable crux.
I expect this is the sort of thing that can be disproven (if LLM-based AI agents actually do start displacing nontrivial swathes of e. g. non-entry-level SWE workers in 2025-2026), but only “proven” gradually (if “AI agents start displacing nontrivial swathes of some highly skilled cognitive-worker demographic” continually fails to happen year after year after year).
Overall, operationalizing bets/empirical tests about this has remained a cursed problem.
Edit:
As a potentially relevant factor: Were you ever surprised by how unbalanced the progress and the adoption have been? The unexpected mixes of capabilities and incapabilities that AI models have displayed?
My current model is centered on trying to explain this surprising mix (top-tier/superhuman benchmark performance vs. frequent falling-flat-on-its-face real-world performance). My current guess is basically that all capabilities progress has been effectively goodharting on legible performance (benchmarks and their equivalents) while doing ~0 improvement on everything else. Whatever it is benchmarks and benchmark-like metrics are measuring, it’s not what we think it is.
So what we will always observe is AI getting better and better at any neat empirical test we can devise, always seeming on the cusp of being transformative, while continually and inexplicably failing to tilt over into actually being transformative. (The actual performance of o3 and GPT-5/6 would be a decisive test of this model for me.)
top-tier/superhuman benchmark performance vs. frequent falling-flat-on-its-face real-world performance
Models are just recently getting to the point where they can complete 2 hour tasks 50% of the time in METR’s tasks (at least without scaffolding that uses much more inference compute).
This isn’t yet top tier performance, so I don’t see the implication. The key claim is that progress here is very fast.
So, I don’t currently feel that strongly that there is a huge benchmark vs real performance gap in at least autonomous SWE-ish tasks? (There might be in math and I agree that if you just looked at math and exam question benchmarks and compared to humans, the models seem much smarter than they are.)
Something interesting here is that a part of why AI companies won’t want to use agents is because their capabilities are good enough that being very reckless with them might actually cause small-scale misalignment issues, and if that’s truly a big part of the problem in getting companies to adopt AI agents, this is good news for our future:
FWIW my vibe is closer to Thane’s. Yesterday I commented that this discussion has been raising some topics that seem worthy of a systematic writeup as fodder for further discussion. I think here we’ve hit on another such topic: enumerating important dimensions of AI capability – such as generation of deep insights, or taking broader context into account – and then kicking off a discussion of the past trajectory / expected future progress on each dimension.
Some benchmarks got saturated across this range, so we can imagine “anti-saturated” benchmarks that didn’t yet noticeably move from zero, operationalizing intuitions of lack of progress. Performance on such benchmarks still has room to change significantly even with pretraining scaling in the near future, from 1e26 FLOPs of currently deployed models to 5e28 FLOPs by 2028, 500x more.
Huh, I disagree reasonably strongly with this. Possible that something along these lines is an empirically testable crux.
I expect this is the sort of thing that can be disproven (if LLM-based AI agents actually do start displacing nontrivial swathes of e. g. non-entry-level SWE workers in 2025-2026), but only “proven” gradually (if “AI agents start displacing nontrivial swathes of some highly skilled cognitive-worker demographic” continually fails to happen year after year after year).
Overall, operationalizing bets/empirical tests about this has remained a cursed problem.
Edit:
As a potentially relevant factor: Were you ever surprised by how unbalanced the progress and the adoption have been? The unexpected mixes of capabilities and incapabilities that AI models have displayed?
My current model is centered on trying to explain this surprising mix (top-tier/superhuman benchmark performance vs. frequent falling-flat-on-its-face real-world performance). My current guess is basically that all capabilities progress has been effectively goodharting on legible performance (benchmarks and their equivalents) while doing ~0 improvement on everything else. Whatever it is benchmarks and benchmark-like metrics are measuring, it’s not what we think it is.
So what we will always observe is AI getting better and better at any neat empirical test we can devise, always seeming on the cusp of being transformative, while continually and inexplicably failing to tilt over into actually being transformative. (The actual performance of o3 and GPT-5/6 would be a decisive test of this model for me.)
Models are just recently getting to the point where they can complete 2 hour tasks 50% of the time in METR’s tasks (at least without scaffolding that uses much more inference compute).
This isn’t yet top tier performance, so I don’t see the implication. The key claim is that progress here is very fast.
So, I don’t currently feel that strongly that there is a huge benchmark vs real performance gap in at least autonomous SWE-ish tasks? (There might be in math and I agree that if you just looked at math and exam question benchmarks and compared to humans, the models seem much smarter than they are.)
Something interesting here is that a part of why AI companies won’t want to use agents is because their capabilities are good enough that being very reckless with them might actually cause small-scale misalignment issues, and if that’s truly a big part of the problem in getting companies to adopt AI agents, this is good news for our future:
https://www.lesswrong.com/posts/K2D45BNxnZjdpSX2j/?commentId=qEkRqHtSJfoDA7zJX
FWIW my vibe is closer to Thane’s. Yesterday I commented that this discussion has been raising some topics that seem worthy of a systematic writeup as fodder for further discussion. I think here we’ve hit on another such topic: enumerating important dimensions of AI capability – such as generation of deep insights, or taking broader context into account – and then kicking off a discussion of the past trajectory / expected future progress on each dimension.
Some benchmarks got saturated across this range, so we can imagine “anti-saturated” benchmarks that didn’t yet noticeably move from zero, operationalizing intuitions of lack of progress. Performance on such benchmarks still has room to change significantly even with pretraining scaling in the near future, from 1e26 FLOPs of currently deployed models to 5e28 FLOPs by 2028, 500x more.