I think quite a lot of the disagreement about AGI timelines comes from differing attitudes towards benchmarks. It’s unclear to me how big the gap is between benchmarks and real-world applications. Benchmarks are increasingly trying to evaluate more real-world tasks and incorporate subjective evaluation where it’s difficult to do purely automated evaluation. I think I’m now of the opinion that if you ignore benchmarks and think that improvement on benchmarks isn’t much indication of progress towards truly transformative AI, you’re making a mistake. (But I’m not sure.)
One reason for this is that if you start with that attitude, then you can think about what would make for better benchmarks that actually measure the thing you care about. And it turns out a bunch of other people have done that and have been making better and better benchmarks, and there’s progress on most of those. And the things that I think really matter for AGI capabilities, like automating AI research, seem to be progressing. I have remaining uncertainty about whether even these new benchmarks and evaluations are catering to current capabilities too much, it’s not impossible to me that there are big missing pieces.
I agree that ignoring benchmarks is wrong (and think our views are kind of nearby one another in absolute terms). However, benchmarks remain pretty bad, labs continue to hill climb on them (how much is unclear, but it happens), and the authors of the most celebrated benchmarks are extremely modest about how their results ought to be interpreted.
Benchmarks show that models are getting better; how fast and at what is still pretty ambiguous when you include these considerations, imo.
I see strong post-deployment learning as a crux for AIs capable of true RSI or full automation of civilization, and benchmarks only start measuring this capability-aspect when they require very long contexts needed to learn deep skills rather than for looking things up. But weaker benchmarks still give some signal, since sufficiently strong in-context learning could in principle be sufficient, and changing the architecture to enable arbitrarily long contexts seems more straightforward than figuring out how to train better in-context learning over very long contexts.
So benchmarks are currently only weakly informative about strong test-time adaptation, and there doesn’t seem to be much in the way of public info on how the quality of in-context learning could significantly improve with current methods (as compute scales), so a priori this weakly observable capability-aspect is probably not improving very much (won’t improve sufficiently any time soon, preventing full AGI without substantial algorithmic progress). But plausibly RLVR wasn’t yet seriously applied to training very long context comprehension (everyone was too busy applying it to the more obvious things), and something like next-token prediction RLVR could also have a significant effect. Not to mention LLMs RLVRed to RLVR themselves, but this is likely too fiddly to start working in practice soon.
I think quite a lot of the disagreement about AGI timelines comes from differing attitudes towards benchmarks. It’s unclear to me how big the gap is between benchmarks and real-world applications. Benchmarks are increasingly trying to evaluate more real-world tasks and incorporate subjective evaluation where it’s difficult to do purely automated evaluation. I think I’m now of the opinion that if you ignore benchmarks and think that improvement on benchmarks isn’t much indication of progress towards truly transformative AI, you’re making a mistake. (But I’m not sure.)
One reason for this is that if you start with that attitude, then you can think about what would make for better benchmarks that actually measure the thing you care about. And it turns out a bunch of other people have done that and have been making better and better benchmarks, and there’s progress on most of those. And the things that I think really matter for AGI capabilities, like automating AI research, seem to be progressing. I have remaining uncertainty about whether even these new benchmarks and evaluations are catering to current capabilities too much, it’s not impossible to me that there are big missing pieces.
I agree that ignoring benchmarks is wrong (and think our views are kind of nearby one another in absolute terms). However, benchmarks remain pretty bad, labs continue to hill climb on them (how much is unclear, but it happens), and the authors of the most celebrated benchmarks are extremely modest about how their results ought to be interpreted.
Benchmarks show that models are getting better; how fast and at what is still pretty ambiguous when you include these considerations, imo.
I see strong post-deployment learning as a crux for AIs capable of true RSI or full automation of civilization, and benchmarks only start measuring this capability-aspect when they require very long contexts needed to learn deep skills rather than for looking things up. But weaker benchmarks still give some signal, since sufficiently strong in-context learning could in principle be sufficient, and changing the architecture to enable arbitrarily long contexts seems more straightforward than figuring out how to train better in-context learning over very long contexts.
So benchmarks are currently only weakly informative about strong test-time adaptation, and there doesn’t seem to be much in the way of public info on how the quality of in-context learning could significantly improve with current methods (as compute scales), so a priori this weakly observable capability-aspect is probably not improving very much (won’t improve sufficiently any time soon, preventing full AGI without substantial algorithmic progress). But plausibly RLVR wasn’t yet seriously applied to training very long context comprehension (everyone was too busy applying it to the more obvious things), and something like next-token prediction RLVR could also have a significant effect. Not to mention LLMs RLVRed to RLVR themselves, but this is likely too fiddly to start working in practice soon.