I still think this post is correct in spirit, and was part of my journey towards good understanding of neuroscience, and promising ideas in AGI alignment / safety.
But there are a bunch of little things that I got wrong or explained poorly. Shall I list them?
First, my “neocortex vs subcortex” division eventually developed into “learning subsystem vs steering subsystem”, with the latter being mostly just the hypothalamus and brainstem, and the former being everything else, particularly the whole telencephalon and cerebellum. The main difference is that the “learning subsystem” does “learning-from-scratch” in the sense here.
Second, whenever I said “amygdala”, I probably should have said “anterior insula”, or better yet “some cortico-basal ganglia-thalamocortical loop involving anterior insula and ventral striatum”. Back when I wrote this, I thought that supervised learning was the unique realm of the cerebellum and amygdala, but now I think that it’s one aspect of the functioning of (parts of) the neocortex too. See here.
Third, I kinda mangled the description of what happens when the rat’s brainstem is craving salt and then learns that saltwater is expected. Keep in mind that nibbling the lever is pointless. The lever doesn’t do anything. It never did! (This experiment is in the “Pavlovian” paradigm not the “instrumental” paradigm.) So why does the rat run to it and nibble at it?
It seems to me that these Pavlovian experiments are just really weird. Under normal circumstances, saltwater winds up in a rat’s mouth because the rat was drinking it. Here, the rat is just doing whatever, and magically (from the rat’s perspective), saltwater appears in the rat’s mouth, thanks of course to the scientists’ crazy saltwater-squirting backpack contraption.
I think that when the brainstem thinks “oh wow I’m expecting very good things to happen imminently”, that turns into something like “hey cortex, whatever thought you happen to be thinking right now, do it right now, do it with great vigor!!!” Because, in the normal ecological situation, the thought that the rat is thinking is the cause of “expecting very good things to happen”.
But in these Pavlovian experiments, the good thing happens out of nowhere, so the behavior comes down to “whatever the rat happens to be thinking about at the key moment”. And this is actually underdetermined! Different rats’ minds tend to go to different places, and hence they wind up doing different things in these Pavlovian experiments. Thus, in the lingo, some rats are “sign-tracking rats”, other rats are “goal-tracking rats”.
Anyway, in this case, we wind up with the rats looking at and attending to the lever (because it appeared at the right time), and the brainstem says “Yes whatever you’re thinking about, it’s awesome, do it with vigor”. Attending to the lever isn’t exactly an action proposal per se, i.e. it’s not something the rat can “do”, but it happens to overlap with the beginning stage of the salient action plan “go to the lever and nibble the lever”. So that’s what happens. And I just think maybe we shouldn’t think too hard about the details here.
By contrast, in the article, I told a story involving a time-derivative. I think that story is right in other contexts—see here. Just probably not here.
Fourth, my discussion of Hypotheses 2 & 3 weren’t quite hitting the nail on the head. There are a few issues in play:
(A) Supervised learning vs RL.
Supervised learning is something like learning an N-dimensional output with an N-dimensional ground truth, so you get an error gradient “for free” each query. Reinforcement learning is something like learning an N-dimensional output with a 1-dimensional “reward” ground truth, and tends to require trial-and-error. This is an important distinction in many contexts, but in retrospect it’s not so important for this post.
(B) One “system” vs two “systems”.
Let’s say I want to salivate profusely right now. I can’t just consciously decide to do that. It doesn’t work. I can try to vividly imagine eating a salty cracker. That works a little bit. Or I can go to the pantry and get an actual cracker. That works better.
What we’re seeing here is two systems, one that we associate with free will etc., and another that “decides” whether to salivate. The second system is not under control of the first system. Both systems learn, but with different training signals. See Reward is Not Enough.
(C) Adversarial dynamics
…And thus, we can think of this as a kind of adversarial ML type thing. Every time I (the first system) trick the second system into salivating, without later eating salt, then there’s a training signal that helps the second system learn not to be fooled. That’s not to say they’re evenly matched; it’s also possible that, in equilibrium, the first system winds up consistently calling the shots.
Thanks to adversarial dynamics, by the way, my story about why hypothesis 2 is wrong, isn’t as compelling a reason as I had thought.
Also, the difference between Hypotheses 2 and 3 is less profound than it seems, because “two systems” maximizing A and B respectively is fundamentally not so different from “one system” maximizing A+B, for example. The implementation is still different, and the learning speed is different, and the corresponding bundle of intuitions is kinda different. So I still think Hypothesis 3 is the right way to think about it.
Fifth, having learned more about the neocortex, I’m more confidently opposed to Hypothesis 1.
Sixth, I didn’t know anything about this at the time, but there’s an interesting connection to the “incentive learning” literature, which involves various other rat experiments that seem to contradict the Dead Sea Salt experiment—rats need to learn from experience, in situations that (on a naive reading of this post) one might expect the rats to be able to do the task optimally the first time, without learning. This is a fun topic and I have a draft about it that I’ll post at some point.
I still think this post is correct in spirit, and was part of my journey towards good understanding of neuroscience, and promising ideas in AGI alignment / safety.
But there are a bunch of little things that I got wrong or explained poorly. Shall I list them?
First, my “neocortex vs subcortex” division eventually developed into “learning subsystem vs steering subsystem”, with the latter being mostly just the hypothalamus and brainstem, and the former being everything else, particularly the whole telencephalon and cerebellum. The main difference is that the “learning subsystem” does “learning-from-scratch” in the sense here.
Second, whenever I said “amygdala”, I probably should have said “anterior insula”, or better yet “some cortico-basal ganglia-thalamocortical loop involving anterior insula and ventral striatum”. Back when I wrote this, I thought that supervised learning was the unique realm of the cerebellum and amygdala, but now I think that it’s one aspect of the functioning of (parts of) the neocortex too. See here.
Third, I kinda mangled the description of what happens when the rat’s brainstem is craving salt and then learns that saltwater is expected. Keep in mind that nibbling the lever is pointless. The lever doesn’t do anything. It never did! (This experiment is in the “Pavlovian” paradigm not the “instrumental” paradigm.) So why does the rat run to it and nibble at it?
It seems to me that these Pavlovian experiments are just really weird. Under normal circumstances, saltwater winds up in a rat’s mouth because the rat was drinking it. Here, the rat is just doing whatever, and magically (from the rat’s perspective), saltwater appears in the rat’s mouth, thanks of course to the scientists’ crazy saltwater-squirting backpack contraption.
I think that when the brainstem thinks “oh wow I’m expecting very good things to happen imminently”, that turns into something like “hey cortex, whatever thought you happen to be thinking right now, do it right now, do it with great vigor!!!” Because, in the normal ecological situation, the thought that the rat is thinking is the cause of “expecting very good things to happen”.
But in these Pavlovian experiments, the good thing happens out of nowhere, so the behavior comes down to “whatever the rat happens to be thinking about at the key moment”. And this is actually underdetermined! Different rats’ minds tend to go to different places, and hence they wind up doing different things in these Pavlovian experiments. Thus, in the lingo, some rats are “sign-tracking rats”, other rats are “goal-tracking rats”.
Anyway, in this case, we wind up with the rats looking at and attending to the lever (because it appeared at the right time), and the brainstem says “Yes whatever you’re thinking about, it’s awesome, do it with vigor”. Attending to the lever isn’t exactly an action proposal per se, i.e. it’s not something the rat can “do”, but it happens to overlap with the beginning stage of the salient action plan “go to the lever and nibble the lever”. So that’s what happens. And I just think maybe we shouldn’t think too hard about the details here.
By contrast, in the article, I told a story involving a time-derivative. I think that story is right in other contexts—see here. Just probably not here.
Fourth, my discussion of Hypotheses 2 & 3 weren’t quite hitting the nail on the head. There are a few issues in play:
(A) Supervised learning vs RL.
Supervised learning is something like learning an N-dimensional output with an N-dimensional ground truth, so you get an error gradient “for free” each query. Reinforcement learning is something like learning an N-dimensional output with a 1-dimensional “reward” ground truth, and tends to require trial-and-error. This is an important distinction in many contexts, but in retrospect it’s not so important for this post.
(B) One “system” vs two “systems”.
Let’s say I want to salivate profusely right now. I can’t just consciously decide to do that. It doesn’t work. I can try to vividly imagine eating a salty cracker. That works a little bit. Or I can go to the pantry and get an actual cracker. That works better.
What we’re seeing here is two systems, one that we associate with free will etc., and another that “decides” whether to salivate. The second system is not under control of the first system. Both systems learn, but with different training signals. See Reward is Not Enough.
(C) Adversarial dynamics
…And thus, we can think of this as a kind of adversarial ML type thing. Every time I (the first system) trick the second system into salivating, without later eating salt, then there’s a training signal that helps the second system learn not to be fooled. That’s not to say they’re evenly matched; it’s also possible that, in equilibrium, the first system winds up consistently calling the shots.
Thanks to adversarial dynamics, by the way, my story about why hypothesis 2 is wrong, isn’t as compelling a reason as I had thought.
Also, the difference between Hypotheses 2 and 3 is less profound than it seems, because “two systems” maximizing A and B respectively is fundamentally not so different from “one system” maximizing A+B, for example. The implementation is still different, and the learning speed is different, and the corresponding bundle of intuitions is kinda different. So I still think Hypothesis 3 is the right way to think about it.
Fifth, having learned more about the neocortex, I’m more confidently opposed to Hypothesis 1.
Sixth, I didn’t know anything about this at the time, but there’s an interesting connection to the “incentive learning” literature, which involves various other rat experiments that seem to contradict the Dead Sea Salt experiment—rats need to learn from experience, in situations that (on a naive reading of this post) one might expect the rats to be able to do the task optimally the first time, without learning. This is a fun topic and I have a draft about it that I’ll post at some point.