Better RL algorithms (including things that also improve pretraining sample efficiency like better architectures and optimizers).
Smarter models (in particular, smarter base models, though I’d also expect that RL in one domain makes RL in some other domain more sample efficient, at least after sufficient scale).
There isn’t great evidence that we’ve been seeing substantial improvements in RL algorithms recently, but AI companies are strongly incentivized to improve RL sample efficiency (as RL scaling appears to be yielding large returns) and there are a bunch of ML papers which claim to find substantial RL improvements (though it’s hard to be confident in these results for various reasons). So, we should probably infer that AI companies have made substantial gains in RL algorithms, but we don’t have public numbers. 3.7 sonnet was much better than 3.5 sonnet, but it’s hard to know how much of the gain was from RL algorithms vs from other areas.
Minimally, there is a long running trend in pretraining algorithmic efficiency and many of these improvements should also transfer somewhat to RL sample efficiency.
As far as evidence that smarter models learn more sample efficiently, I think the deepseek R1 paper has some results on this. Probably it’s also possible to find various pieces of support for this in the literature, but I’m more familiar with various anecdotes.
RL sample efficiency can be improved by both:
Better RL algorithms (including things that also improve pretraining sample efficiency like better architectures and optimizers).
Smarter models (in particular, smarter base models, though I’d also expect that RL in one domain makes RL in some other domain more sample efficient, at least after sufficient scale).
There isn’t great evidence that we’ve been seeing substantial improvements in RL algorithms recently, but AI companies are strongly incentivized to improve RL sample efficiency (as RL scaling appears to be yielding large returns) and there are a bunch of ML papers which claim to find substantial RL improvements (though it’s hard to be confident in these results for various reasons). So, we should probably infer that AI companies have made substantial gains in RL algorithms, but we don’t have public numbers. 3.7 sonnet was much better than 3.5 sonnet, but it’s hard to know how much of the gain was from RL algorithms vs from other areas.
Minimally, there is a long running trend in pretraining algorithmic efficiency and many of these improvements should also transfer somewhat to RL sample efficiency.
As far as evidence that smarter models learn more sample efficiently, I think the deepseek R1 paper has some results on this. Probably it’s also possible to find various pieces of support for this in the literature, but I’m more familiar with various anecdotes.