Humans also have massively more unsupervised learning than RL learning, for similar reasons: unsupervised learning data is extremely cheap and predictive processing is always on; you get MB/s for initial vision, I’d guesstimate kB/s for the highest level compressed abstractions from senses as input to consciousness (“scene graph” level while seeing moving objects, “parsed audio” level, etc), conscious decision making has been estimated to be on order 10b/s (“The Unbearable Slowness of Being: Why do we live at 10 bits/s?”), but you only get maybe a 3 bits per second of reward model feedback (dopamine is slower and usually doesn’t have something to say about every action), and bits per minute or hour for overall task success (the underlying thing dopamine is the predictor for). And yet humans end up extremely competent at advanced disciplines. Presumably unsupervised modeling of experience data generated by the agency is doing most of the work to get from microseconds to seconds, and the reward model closes the remaining gap from seconds to hours.
Relatedly, I don’t buy the recent claims that continual learning is not a big deal. It might not be enough to massively exceed human level, but it seems likely that it will be qualitatively stronger than in-context learning, because it can actually move concepts around, saving superposition bandwidth in the residual stream for actually-dynamic things.
Humans also have massively more unsupervised learning than RL learning, for similar reasons: unsupervised learning data is extremely cheap and predictive processing is always on; you get MB/s for initial vision, I’d guesstimate kB/s for the highest level compressed abstractions from senses as input to consciousness (“scene graph” level while seeing moving objects, “parsed audio” level, etc), conscious decision making has been estimated to be on order 10b/s (“The Unbearable Slowness of Being: Why do we live at 10 bits/s?”), but you only get maybe a 3 bits per second of reward model feedback (dopamine is slower and usually doesn’t have something to say about every action), and bits per minute or hour for overall task success (the underlying thing dopamine is the predictor for). And yet humans end up extremely competent at advanced disciplines. Presumably unsupervised modeling of experience data generated by the agency is doing most of the work to get from microseconds to seconds, and the reward model closes the remaining gap from seconds to hours.
Relatedly, I don’t buy the recent claims that continual learning is not a big deal. It might not be enough to massively exceed human level, but it seems likely that it will be qualitatively stronger than in-context learning, because it can actually move concepts around, saving superposition bandwidth in the residual stream for actually-dynamic things.