I think jaggedness of RL (in modern LLMs) is an obstruction that would need to be addressed explicitly, otherwise it won’t fall to incremental improvements or scaffolding. There are two very different levels of capability, obtained in pretraining and in RLVR, but only pretraining is somewhat general. And even pretraining doesn’t adapt to novel situations other than through in-context learning, which only expresses capabilities at the level of pretraining, significantly weaker than RLVR-trained narrow capabilities.
Scaling will make pretraining stronger, but probably not sufficiently to matter for this issue, and natural text data will only last for another step of improvement similar to what happened in 2023-2025 (in pretraining only, ignoring RLVR). If RL doesn’t get more general, it’ll probably remain useless for improving general capabilities outside the skills trained with RLVR. Capabilities will remain jagged, with gaps that have to be addressed manually by changing the training data.
This could change within a few years, possibly even faster if LLMs can be RLVRed to become able to RLVR themselves, though that won’t necessarily work. Or via next token prediction RLVR that makes pretraining stronger without requiring more natural text data, but this probably needs much more compute even if it works in principle, so might also take 5-10 years, to uncertain capability level results.
I think jaggedness of RL (in modern LLMs) is an obstruction that would need to be addressed explicitly, otherwise it won’t fall to incremental improvements or scaffolding. There are two very different levels of capability, obtained in pretraining and in RLVR, but only pretraining is somewhat general. And even pretraining doesn’t adapt to novel situations other than through in-context learning, which only expresses capabilities at the level of pretraining, significantly weaker than RLVR-trained narrow capabilities.
Scaling will make pretraining stronger, but probably not sufficiently to matter for this issue, and natural text data will only last for another step of improvement similar to what happened in 2023-2025 (in pretraining only, ignoring RLVR). If RL doesn’t get more general, it’ll probably remain useless for improving general capabilities outside the skills trained with RLVR. Capabilities will remain jagged, with gaps that have to be addressed manually by changing the training data.
This could change within a few years, possibly even faster if LLMs can be RLVRed to become able to RLVR themselves, though that won’t necessarily work. Or via next token prediction RLVR that makes pretraining stronger without requiring more natural text data, but this probably needs much more compute even if it works in principle, so might also take 5-10 years, to uncertain capability level results.