One bit of evidence here (and this is prior to the RL stage) is that you need a lot more compute to train the base model than you need for the fine-tuning step. Summoning a rich set of concepts from the ether takes the vast majority of the effort, compared to highlighting the important ones.
Before LLMs, RL had very unimpressive results in rich domains (because random flailing wouldn’t get you a meaningful amount of learning) and people kept talking about “model-based RL” but their handmade world-model architectures just didn’t work.
I’m arguing that the reason for this is that the vast majority of the effort needed for RL in a rich domain comes from assembling relevant concepts, and that shaping behavior once you have those concepts is a lot more efficient. (And hand-made world models just didn’t include enough important concepts.)
Humans also have massively more unsupervised learning than RL learning, for similar reasons: unsupervised learning data is extremely cheap and predictive processing is always on; you get MB/s for initial vision, I’d guesstimate kB/s for the highest level compressed abstractions from senses as input to consciousness (“scene graph” level while seeing moving objects, “parsed audio” level, etc), conscious decision making has been estimated to be on order 10b/s (“The Unbearable Slowness of Being: Why do we live at 10 bits/s?”), but you only get maybe a 3 bits per second of reward model feedback (dopamine is slower and usually doesn’t have something to say about every action), and bits per minute or hour for overall task success (the underlying thing dopamine is the predictor for). And yet humans end up extremely competent at advanced disciplines. Presumably unsupervised modeling of experience data generated by the agency is doing most of the work to get from microseconds to seconds, and the reward model closes the remaining gap from seconds to hours.
Relatedly, I don’t buy the recent claims that continual learning is not a big deal. It might not be enough to massively exceed human level, but it seems likely that it will be qualitatively stronger than in-context learning, because it can actually move concepts around, saving superposition bandwidth in the residual stream for actually-dynamic things.
One bit of evidence here (and this is prior to the RL stage) is that you need a lot more compute to train the base model than you need for the fine-tuning step. Summoning a rich set of concepts from the ether takes the vast majority of the effort, compared to highlighting the important ones.
Before LLMs, RL had very unimpressive results in rich domains (because random flailing wouldn’t get you a meaningful amount of learning) and people kept talking about “model-based RL” but their handmade world-model architectures just didn’t work.
I’m arguing that the reason for this is that the vast majority of the effort needed for RL in a rich domain comes from assembling relevant concepts, and that shaping behavior once you have those concepts is a lot more efficient. (And hand-made world models just didn’t include enough important concepts.)
Humans also have massively more unsupervised learning than RL learning, for similar reasons: unsupervised learning data is extremely cheap and predictive processing is always on; you get MB/s for initial vision, I’d guesstimate kB/s for the highest level compressed abstractions from senses as input to consciousness (“scene graph” level while seeing moving objects, “parsed audio” level, etc), conscious decision making has been estimated to be on order 10b/s (“The Unbearable Slowness of Being: Why do we live at 10 bits/s?”), but you only get maybe a 3 bits per second of reward model feedback (dopamine is slower and usually doesn’t have something to say about every action), and bits per minute or hour for overall task success (the underlying thing dopamine is the predictor for). And yet humans end up extremely competent at advanced disciplines. Presumably unsupervised modeling of experience data generated by the agency is doing most of the work to get from microseconds to seconds, and the reward model closes the remaining gap from seconds to hours.
Relatedly, I don’t buy the recent claims that continual learning is not a big deal. It might not be enough to massively exceed human level, but it seems likely that it will be qualitatively stronger than in-context learning, because it can actually move concepts around, saving superposition bandwidth in the residual stream for actually-dynamic things.