While there are various issues with it, one anchor for comparing the “degree to which LLMs are shaped by RL vs pretraining” is “how many distinct ‘tasks’ was the LLM given to complete under each?”.
In pretraining, each forward pass corresponds to one evaluatable and distinct ‘reward’-event. In RL you need many forward passes (my guess is usually on the order of ~1000 for common tasks in the RL training set) to get one such event. So naively, in order to get the same amount of mind-shaping between RL and pretraining, you would have needed to reach the stage where 99.9% of your training is RL, not just >50%.
I think for various reasons this does overestimate how high the ratio would need to be, but I do think it suggests pretraining will play a larger role than naive compute comparisons would suggest in the resulting minds of the LLMs.
Ah, Claude helped me remember the historical parallel that serves as an intuition pump: in the early days of the deep learning revolution, Hinton and Bengio found it extremely useful to do unsupervised learning on a network first, before doing supervised learning. The post-unsupervised-learning network ended up in the basin of better local optima because it already represented key concepts.
Analogously, I expect that initializing a RL algorithm with a good predictive network makes it massively better and more efficient.
One bit of evidence here (and this is prior to the RL stage) is that you need a lot more compute to train the base model than you need for the fine-tuning step. Summoning a rich set of concepts from the ether takes the vast majority of the effort, compared to highlighting the important ones.
Before LLMs, RL had very unimpressive results in rich domains (because random flailing wouldn’t get you a meaningful amount of learning) and people kept talking about “model-based RL” but their handmade world-model architectures just didn’t work.
I’m arguing that the reason for this is that the vast majority of the effort needed for RL in a rich domain comes from assembling relevant concepts, and that shaping behavior once you have those concepts is a lot more efficient. (And hand-made world models just didn’t include enough important concepts.)
Humans also have massively more unsupervised learning than RL learning, for similar reasons: unsupervised learning data is extremely cheap and predictive processing is always on; you get MB/s for initial vision, I’d guesstimate kB/s for the highest level compressed abstractions from senses as input to consciousness (“scene graph” level while seeing moving objects, “parsed audio” level, etc), conscious decision making has been estimated to be on order 10b/s (“The Unbearable Slowness of Being: Why do we live at 10 bits/s?”), but you only get maybe a 3 bits per second of reward model feedback (dopamine is slower and usually doesn’t have something to say about every action), and bits per minute or hour for overall task success (the underlying thing dopamine is the predictor for). And yet humans end up extremely competent at advanced disciplines. Presumably unsupervised modeling of experience data generated by the agency is doing most of the work to get from microseconds to seconds, and the reward model closes the remaining gap from seconds to hours.
Relatedly, I don’t buy the recent claims that continual learning is not a big deal. It might not be enough to massively exceed human level, but it seems likely that it will be qualitatively stronger than in-context learning, because it can actually move concepts around, saving superposition bandwidth in the residual stream for actually-dynamic things.
In pretraining, each forward pass corresponds to one evaluatable and distinct ‘reward’-event.
In pretraining, you get one loss signal for each token in the forward pass; a single batch typically contains 10-100M tokens. For RL, you get a few bits of reward for each trajectory, which consists of many forward passes. So the efficiency difference is even larger than you outline here.
for RL, the loss signal is spread across all tokens in a trajectory by either the reward model or just the policy gradient. Either way, there’s still a gradient passing into all the output tokens. That gradient contains less shannon information, but might not contain as much less V-information as you’d think.
And yet, current LLMs have noticeably different personas from each other, as well as coding skills that significantly outstrip what you would expect from imitation of the corpus. So their post-training has a large impact.
While there are various issues with it, one anchor for comparing the “degree to which LLMs are shaped by RL vs pretraining” is “how many distinct ‘tasks’ was the LLM given to complete under each?”.
In pretraining, each forward pass corresponds to one evaluatable and distinct ‘reward’-event. In RL you need many forward passes (my guess is usually on the order of ~1000 for common tasks in the RL training set) to get one such event. So naively, in order to get the same amount of mind-shaping between RL and pretraining, you would have needed to reach the stage where 99.9% of your training is RL, not just >50%.
I think for various reasons this does overestimate how high the ratio would need to be, but I do think it suggests pretraining will play a larger role than naive compute comparisons would suggest in the resulting minds of the LLMs.
Ah, Claude helped me remember the historical parallel that serves as an intuition pump: in the early days of the deep learning revolution, Hinton and Bengio found it extremely useful to do unsupervised learning on a network first, before doing supervised learning. The post-unsupervised-learning network ended up in the basin of better local optima because it already represented key concepts.
Analogously, I expect that initializing a RL algorithm with a good predictive network makes it massively better and more efficient.
One bit of evidence here (and this is prior to the RL stage) is that you need a lot more compute to train the base model than you need for the fine-tuning step. Summoning a rich set of concepts from the ether takes the vast majority of the effort, compared to highlighting the important ones.
Before LLMs, RL had very unimpressive results in rich domains (because random flailing wouldn’t get you a meaningful amount of learning) and people kept talking about “model-based RL” but their handmade world-model architectures just didn’t work.
I’m arguing that the reason for this is that the vast majority of the effort needed for RL in a rich domain comes from assembling relevant concepts, and that shaping behavior once you have those concepts is a lot more efficient. (And hand-made world models just didn’t include enough important concepts.)
Humans also have massively more unsupervised learning than RL learning, for similar reasons: unsupervised learning data is extremely cheap and predictive processing is always on; you get MB/s for initial vision, I’d guesstimate kB/s for the highest level compressed abstractions from senses as input to consciousness (“scene graph” level while seeing moving objects, “parsed audio” level, etc), conscious decision making has been estimated to be on order 10b/s (“The Unbearable Slowness of Being: Why do we live at 10 bits/s?”), but you only get maybe a 3 bits per second of reward model feedback (dopamine is slower and usually doesn’t have something to say about every action), and bits per minute or hour for overall task success (the underlying thing dopamine is the predictor for). And yet humans end up extremely competent at advanced disciplines. Presumably unsupervised modeling of experience data generated by the agency is doing most of the work to get from microseconds to seconds, and the reward model closes the remaining gap from seconds to hours.
Relatedly, I don’t buy the recent claims that continual learning is not a big deal. It might not be enough to massively exceed human level, but it seems likely that it will be qualitatively stronger than in-context learning, because it can actually move concepts around, saving superposition bandwidth in the residual stream for actually-dynamic things.
In pretraining, you get one loss signal for each token in the forward pass; a single batch typically contains 10-100M tokens. For RL, you get a few bits of reward for each trajectory, which consists of many forward passes. So the efficiency difference is even larger than you outline here.
for RL, the loss signal is spread across all tokens in a trajectory by either the reward model or just the policy gradient. Either way, there’s still a gradient passing into all the output tokens. That gradient contains less shannon information, but might not contain as much less V-information as you’d think.
And yet, current LLMs have noticeably different personas from each other, as well as coding skills that significantly outstrip what you would expect from imitation of the corpus. So their post-training has a large impact.
The pre-training forms the foundation (LeCun: “Self-supervised learning: The dark matter of intelligence”, tailcalled: “At its most basic, unsupervised prediction forms a good foundation for later specializing the map to perform specific types of prediction”) which gives the model common sense and general abilities, while reinforcement learning adds something like goal orientation on top.