I see strong post-deployment learning as a crux for AIs capable of true RSI or full automation of civilization, and benchmarks only start measuring this capability-aspect when they require very long contexts needed to learn deep skills rather than for looking things up. But weaker benchmarks still give some signal, since sufficiently strong in-context learning could in principle be sufficient, and changing the architecture to enable arbitrarily long contexts seems more straightforward than figuring out how to train better in-context learning over very long contexts.
So benchmarks are currently only weakly informative about strong test-time adaptation, and there doesn’t seem to be much in the way of public info on how the quality of in-context learning could significantly improve with current methods (as compute scales), so a priori this weakly observable capability-aspect is probably not improving very much (won’t improve sufficiently any time soon, preventing full AGI without substantial algorithmic progress). But plausibly RLVR wasn’t yet seriously applied to training very long context comprehension (everyone was too busy applying it to the more obvious things), and something like next-token prediction RLVR could also have a significant effect. Not to mention LLMs RLVRed to RLVR themselves, but this is likely too fiddly to start working in practice soon.
I see strong post-deployment learning as a crux for AIs capable of true RSI or full automation of civilization, and benchmarks only start measuring this capability-aspect when they require very long contexts needed to learn deep skills rather than for looking things up. But weaker benchmarks still give some signal, since sufficiently strong in-context learning could in principle be sufficient, and changing the architecture to enable arbitrarily long contexts seems more straightforward than figuring out how to train better in-context learning over very long contexts.
So benchmarks are currently only weakly informative about strong test-time adaptation, and there doesn’t seem to be much in the way of public info on how the quality of in-context learning could significantly improve with current methods (as compute scales), so a priori this weakly observable capability-aspect is probably not improving very much (won’t improve sufficiently any time soon, preventing full AGI without substantial algorithmic progress). But plausibly RLVR wasn’t yet seriously applied to training very long context comprehension (everyone was too busy applying it to the more obvious things), and something like next-token prediction RLVR could also have a significant effect. Not to mention LLMs RLVRed to RLVR themselves, but this is likely too fiddly to start working in practice soon.