Inner Alignment in Salt-Starved Rats

In­tro­duc­tion: The Dead Sea Salt Experiment

In this 2014 pa­per by Mike Robin­son and Kent Ber­ridge at Univer­sity of Michi­gan (see also this more the­o­ret­i­cal fol­low-up dis­cus­sion by Ber­ridge and Peter Dayan), rats were raised in an en­vi­ron­ment where they were well-nour­ished, and in par­tic­u­lar, where they were never salt-de­prived—not once in their life. The rats were some­times put into a test cage with a lever which, if pressed, would trig­ger a de­vice to spray ridicu­lously salty wa­ter di­rectly into their mouth. The rats pressed this lever once or twice, were dis­gusted and re­pulsed by the ex­treme salt taste, and quickly learned not to press the lever again. One of the rats went so far as to stay tight against the op­po­site wall—as far from the lever as pos­si­ble!

Then the ex­per­i­menters made the rats feel severely salt-de­prived, by de­priv­ing them of salt. Haha, just kid­ding! They made the rats feel severely salt-de­prived by in­ject­ing the rats with a pair of chem­i­cals that are known to in­duce the sen­sa­tion of se­vere salt-de­pri­va­tion. Ah, the won­ders of mod­ern sci­ence!

...And wouldn’t you know it, al­most in­stantly upon in­jec­tion, the rats changed their be­hav­ior! When shown the lever, they now went right over to that lever and jumped on it and gnawed at it, ob­vi­ously des­per­ate for that su­per-salty wa­ter.

The end.

Aren’t you im­pressed? Aren’t you floored? You should be!!! I don’t think any stan­dard ML al­gorithm would be able to do what these rats just did!

Think about it:

  • Is this Re­in­force­ment Learn­ing? No. RL would look like the rats ran­domly stum­bling upon the be­hav­ior of “press­ing the lever when salt-de­prived”, find it re­ward­ing, and then adopt that as a goal via “credit as­sign­ment”. That’s not what hap­pened. While the rats were nib­bling at the lever, they had never in their life had an ex­pe­rience where press­ing the lever led to any­thing other than an ut­terly re­pul­sive ex­pe­rience. And they had never in their life had an ex­pe­rience where they were salt-de­prived, tasted some­thing ex­tremely salty, and found it grat­ify­ing. I mean, they were clearly try­ing to press that lever—this is a fore­sighted plan we’re talk­ing about—but that plan does not seem to have been re­in­forced by any ex­pe­rience in their life.

  • Is this Imi­ta­tion Learn­ing? Ob­vi­ously not; the rats had never seen any other rat press any lever for any rea­son.

  • Is this an in­nate, hard­wired, stim­u­lus-re­sponse be­hav­ior? No, the con­nec­tion be­tween a lever and salt­wa­ter was an ar­bi­trary, learned con­nec­tion. (I didn’t men­tion it, but the re­searchers also played a dis­tinc­tive sound each time the lever ap­peared. Not sure how im­por­tant that is. But any­way, that con­nec­tion is ar­bi­trary and learned, too.)

So what’s the al­gorithm here? How did their brains know that this was a good plan? That’s the sub­ject of this post.

What does this have to do with in­ner al­ign­ment? What is in­ner al­ign­ment any­way? Why should we care about any of this?

With apolo­gies to the reg­u­lars here who already know all this, the so-called “in­ner al­ign­ment prob­lem” oc­curs when you, a pro­gram­mer, build an in­tel­li­gent, fore­sighted, goal-seek­ing agent. You want it to be try­ing to achieve a cer­tain goal, like maybe “do what­ever I, the pro­gram­mer, want you to do” or some­thing. The in­ner al­ign­ment prob­lem is: how do you en­sure that the agent you pro­grammed is ac­tu­ally try­ing to pur­sue that goal? (Mean­while, the “outer al­ign­ment prob­lem” is about choos­ing a good goal in the first place.) The in­ner al­ign­ment prob­lem is ob­vi­ously an im­por­tant safety is­sue, and will be­come in­creas­ingly im­por­tant as our AI sys­tems get more pow­er­ful in the fu­ture.

(See my ear­lier post mesa-op­ti­miz­ers vs “steered op­ti­miz­ers” for speci­fics about how I frame the in­ner al­ign­ment prob­lem in the con­text of brain-like al­gorithms.)

Now, for the rats, there’s an evolu­tion­ar­ily-adap­tive goal of “when in a salt-de­prived state, try to eat salt”. The genome is “try­ing” to in­stall that goal in the rat’s brain. And ap­par­ently, it worked! That goal was in­stalled! And re­mark­ably, that goal was in­stalled even be­fore that situ­a­tion was ever en­coun­tered! So it’s worth study­ing this ex­am­ple—per­haps we can learn from it!

Be­fore we get go­ing on that, one more bor­ing but nec­es­sary thing:

Aside: Obli­ga­tory post-repli­ca­tion-crisis discussion

The dead sea salt ex­per­i­ment strikes me as trust­wor­thy. Pretty much all the rats—and for key as­pects liter­ally ev­ery tested rat—dis­played an ob­vi­ous qual­i­ta­tive be­hav­ioral change al­most in­stan­ta­neously upon in­jec­tion. There were sen­si­ble tests with con­trol lev­ers and with con­trol rats. The au­thors seem to have tested ex­actly one hy­poth­e­sis, and it’s a hy­poth­e­sis that was a pri­ori plau­si­ble and in­ter­est­ing. And so on. I can’t as­sess ev­ery as­pect of the ex­per­i­ment, but from what I see, I be­lieve this ex­per­i­ment, and I’m tak­ing its re­sults at face value. Please do com­ment if you see any­thing ques­tion­able.

Out­line of the rest of the post

Next I’ll go through my hy­poth­e­sis for how the rat brain works its magic here. Ac­tu­ally, I’ve come up with three var­i­ants of this hy­poth­e­sis over the past year or so, and I’ll talk through all of them, in chronolog­i­cal or­der. Then I’ll spec­u­late briefly on other pos­si­ble ex­pla­na­tions.

My hy­poth­e­sis for how the rat brain did what it did

The over­all story

As I dis­cussed in My Com­pu­ta­tional Frame­work for the Brain, my start­ing-point as­sump­tion is that the rat brain has a “neo­cor­tex sub­sys­tem” (re­ally the neo­cor­tex, hip­pocam­pus, parts of tha­la­mus and basal gan­glia, maybe other things too). The neo­cor­tex sub­sys­tem takes in­puts, builds a pre­dic­tive model from scratch, and then chooses thoughts and ac­tions that max­i­mize re­ward. The re­ward, in turn, is is­sued by a differ­ent sub­sys­tem of the brain that I’ll call “sub­cor­tex”.

To grossly over­sim­plify the “neo­cor­tex builds a pre­dic­tive model” part of that, let’s just say for pre­sent pur­poses that the neo­cor­tex sub­sys­tem mem­o­rizes pat­terns in the in­puts, and then pat­terns in the pat­terns, and so on.

To grossly over­sim­plify the “neo­cor­tex chooses thoughts and ac­tions that max­i­mize re­ward” part, let’s just say for pre­sent pur­poses that differ­ent parts of the pre­dic­tive model are as­so­ci­ated with differ­ent re­ward pre­dic­tions, the re­ward pre­dic­tions are up­dated by a TD learn­ing sys­tem that has some­thing to do with dopamine and the basal gan­glia, and parts of the model that pre­dict higher re­ward are fa­vored while parts of the model that pre­dict lower re­ward are pushed out of mind.

Since the “pre­dic­tive model” part is in­voked for the “re­ward-max­i­miza­tion” part, we can say that the neo­cor­tex does model-based RL.

(Aside: It’s some­times claimed in the liter­a­ture that brains do both model-based and model-free RL. I dis­agree that this is a fun­da­men­tal dis­tinc­tion; I think “model-free” = “model-based with a dead-sim­ple model”. See my old com­ment here.)

Why is this im­por­tant? Be­cause that brings us to imag­i­na­tion! The neo­cor­tex can ac­ti­vate parts of the pre­dic­tive model not just to an­ti­ci­pate what is about to hap­pen, but also to imag­ine what may hap­pen, and (re­lat­edly) to re­mem­ber what has hap­pened.

Now we get a cru­cial in­gre­di­ent: I hy­poth­e­size that the sub­cor­tex some­how knows when the neo­cor­tex is imag­in­ing the taste of salt. How? This is the part where I have three ver­sions of the story, which I’ll go through shortly. For now, let’s just as­sume that there is a wire go­ing into the sub­cor­tex, and when it’s firing, that means the neo­cor­tex is ac­ti­vat­ing the parts of the pre­dic­tive model that cor­re­spond (se­man­ti­cally) to tast­ing salt.

Ba­sic setup. The sub­cor­tex has an in­com­ing sig­nal that tells it that the neo­cor­tex is imag­in­ing /​ ex­pect­ing /​ re­mem­ber­ing the taste of salt. I’ll talk about sev­eral pos­si­ble sources of this sig­nal (here marked “???”) in the next sec­tion. Then the sub­cor­tex has a hard­wired cir­cuit that, when­ever the rat is salt-de­prived, is­sues a re­ward to the neo­cor­tex for start­ing to ac­ti­vate this sig­nal (and nega­tive re­ward for stop­ping). The neo­cor­tex now finds it pleas­ing to imag­ine walk­ing over and drink­ing the salt­wa­ter, and it does so!

And once we have that, the last in­gre­di­ent is sim­ple: The sub­cor­tex has an in­nate, hard­wired cir­cuit that says “If the neo­cor­tex is imag­in­ing tast­ing salt, and I am cur­rently salt-de­prived, then send a re­ward to the neo­cor­tex.”

OK! So now the ex­per­i­ment be­gins. The rat is salt-de­prived, and it sees the lever ap­pear. That nat­u­rally evokes its pre­vi­ous mem­ory of tast­ing salt, and that thought is re­warded! When the rat imag­ines walk­ing over and nib­bling the lever, it finds that to be a very pleas­ing (high-re­ward-pre­dic­tion) thought in­deed! So it goes and does it!

(UPDATE: Com­menters point out that this de­scrip­tion isn’t quite right—it doesn’t make sense to say that the idea of tast­ing salt is re­ward­ing per se. Rather, I pro­pose that the sub­cor­tex sends a re­ward re­lated to the time-deriva­tive of how strongly the neo­cor­tex is imag­in­ing /​ ex­pect­ing to taste salt. So the neo­cor­tex gets a re­ward for first en­ter­tain­ing the idea of tast­ing salt, and an­other in­cre­men­tal re­ward for grow­ing that idea into a definite plan. But then it would get a nega­tive re­ward for drop­ping that idea. Sorry for the mis­take /​ con­fu­sion. Thanks com­menters!)

Hy­poth­e­sis 1 for the “imag­in­ing taste of salt” sig­nal: The neo­cor­tex API en­ables out­putting a pre­dic­tion for any given in­put channel

This was my first the­ory, I guess from last year. As ar­gued by the “pre­dic­tive cod­ing” peo­ple, Jeff Hawk­ins, Yann LeCun, and many oth­ers, the neo­cor­tex is con­stantly pre­dict­ing what in­put sig­nals it will re­ceive next, and up­dat­ing its mod­els when the pre­dic­tions are wrong. This sug­gests that it should be pos­si­ble to stick an ar­bi­trary in­put line into the neo­cor­tex, and then pull out a sig­nal car­ry­ing the neo­cor­tex’s pre­dic­tions for that in­put line. (It would look like a slightly-ear­lier copy of the in­put line, with spo­radic er­rors for when the neo­cor­tex is sur­prised.) I can imag­ine, for ex­am­ple, that if you put an in­put sig­nal into cor­ti­cal mini-column #592843 layer 4, then you look at a cer­tain neu­ron in the same mini-column, you find those pre­dic­tions.

If this is the case, then the rest is pretty straight­for­ward. The genome wires the salt taste bud sig­nal to wher­ever in the neo­cor­tex, pulls out the cor­re­spond­ing pre­dic­tion, and we’re done! For the rea­son de­scribed above, that line will also fire when merely imag­in­ing salt taste.

Com­men­tary on hy­poth­e­sis 1: I have mixed feel­ings.

On the one hand, I haven’t re­ally come across any in­de­pen­dent ev­i­dence that this mechanism ex­ists. And, hav­ing learned more about the nitty-gritty of neo­cor­tex al­gorithms (the out­puts come from layer 5, blah blah blah), I don’t think the neo­cor­tex out­puts carry this type of data.

On the other hand, I have a strong prior be­lief that if there are ten ways for the brain to do a cer­tain calcu­la­tion, and each is biolog­i­cally and com­pu­ta­tion­ally plau­si­ble with­out dra­matic ar­chi­tec­tural change, the brain will do all ten! (Prob­a­bly in ten differ­ent ar­eas of the brain.) After all, evolu­tion doesn’t care much about keep­ing things el­e­gant and sim­ple. I mean, there is a pre­dic­tive sig­nal for each in­put—it has to be there some­where! And I don’t cur­rently see any rea­son that this sig­nal couldn’t be ex­tracted from the neo­cor­tex. So I feel sorta obli­gated to be­lieve that this mechanism prob­a­bly ex­ists.

So any­way, all things con­sid­ered, I don’t put much weight on this hy­poth­e­sis, but I also won’t strongly re­ject it.

With that, let’s move on to the later ideas that I like bet­ter.

Hy­poth­e­sis 2 for the “neo­cor­tex is imag­in­ing the taste of salt” sig­nal: The neo­cor­tex is re­warded for “com­mu­ni­cat­ing its thoughts”

This was my sec­ond guess, I guess dat­ing to sev­eral months ago.

The neo­cor­tex sub­sys­tem has a bunch of out­put lines for mo­tor con­trol and what­ever else, and it has a spe­cial out­put line S (S for salt).

Mean­while, the sub­cor­tex sends re­wards un­der var­i­ous cir­cum­stances, and one of those things is that the neo­cor­tex is re­warded for send­ing a sig­nal into S when­ever salt is tasted. (The sub­cor­tex knows when salt is tasted, be­cause it gets a copy of that same in­put.)

So now, as the rat lives its life, it stum­bles upon the be­hav­ior of out­putting a sig­nal into S when eat­ing a bite of saltier-than-usual food. This is re­in­forced, and grad­u­ally be­comes rou­tine.

The rest is as be­fore: when the rat imag­ines a salty taste, it reuses the same model. We did it!

Com­men­tary on hy­poth­e­sis 2: A minor prob­lem (from the point-of-view of evolu­tion) is that it would take a while for the neo­cor­tex to learn to send a sig­nal into S when eat­ing salt. Maybe that’s OK.

A much big­ger po­ten­tial prob­lem is that the neo­cor­tex could learn a pat­tern where it sends a sig­nal into S when tast­ing salt, and also learns a differ­ent pat­tern where it sends a sig­nal into S when­ever salt-de­prived, whether think­ing about salt or not. This pat­tern would, af­ter all, be re­warded, and I can’t im­me­di­ately see how to stop it from de­vel­op­ing.

So I’m pretty skep­ti­cal about this hy­poth­e­sis now.

Hy­poth­e­sis 3 for the “neo­cor­tex is imag­in­ing the taste of salt” sig­nal (my fa­vorite!): Sorta an “in­ter­pretabil­ity” ap­proach, prob­a­bly in­volv­ing the amygdala

This one comes out of my last post, Su­per­vised Learn­ing of Out­puts in the Brain. Now we have a sep­a­rate brain mod­ule that I la­beled “su­per­vised learn­ing al­gorithm”, and which I sus­pect is pri­mar­ily lo­cated in the amyg­dala. This mod­ule does su­per­vised learn­ing: the salt sig­nal (from the taste buds) func­tions as the su­per­vi­sory sig­nal, and a ran­dom as­sort­ment of neu­rons in the neo­cor­tex sub­sys­tem (de­scribing la­tent vari­ables in the neo­cor­tex’s pre­dic­tive model) func­tion as the in­puts to the learned model. Then the su­per­vised learn­ing mod­ule learns which pat­terns in those la­tent vari­ables tend to re­li­ably pre­dict that salt is about to be tasted. Hav­ing done that, when it sees those pat­terns re­cur, that’s our sig­nal that the neo­cor­tex is prob­a­bly ex­pect­ing the taste of salt … and as de­scribed above, it will also see those same pat­terns when the neo­cor­tex is merely imag­in­ing or re­mem­ber­ing the taste of salt. So we have our sig­nal!

Com­men­tary on Hy­poth­e­sis 3: There’s a lot I re­ally like about this. It seems to at-least-vaguely match var­i­ous things I’ve seen in the liter­a­ture about the func­tion­al­ity and con­nec­tivity of the amyg­dala. It makes a lot of sense from a de­sign per­spec­tive—the pat­terns would be learned quickly and re­li­ably, etc., as far as I can tell. I find it satis­fy­ingly ob­vi­ous and nat­u­ral (in ret­ro­spect). So I would put this for­ward as my fa­vorite hy­poth­e­sis by far.

It also trans­fers in an ob­vi­ous way to AGI pro­gram­ming, where it would cor­re­spond to some­thing like an au­to­mated “in­ter­pretabil­ity” mod­ule that tries to make sense of the AGI’s la­tent vari­ables by cor­re­lat­ing them with some other la­beled prop­er­ties of the AGI’s in­puts, and then re­ward­ing the AGI for “think­ing about the right things” (ac­cord­ing to the in­ter­pretabil­ity mod­ule’s out­put), which in turn helps turn those thoughts into the AGI’s goals, us­ing the time-deriva­tive re­ward-shap­ing trick as de­scribed above.

(Is this a good de­sign idea that AGI pro­gram­mers should adopt? I don’t know, but I find it in­ter­est­ing, and at least wor­thy of fur­ther thought. I don’t re­call com­ing across this idea be­fore in the con­text of in­ner al­ign­ment.)

What would other pos­si­ble ex­pla­na­tions for the rat ex­per­i­ment look like?

The the­o­ret­i­cal fol­low-up by Dayan & Ber­ridge is worth read­ing, but I don’t think they pro­pose any real an­swers, just lots of liter­a­ture and in­ter­est­ing ideas at a some­what-more-vague level.

Next: What would Steven Pinker say? He is my rep­re­sen­ta­tive ad­vo­cate of a cer­tain branch of cog­ni­tive neu­ro­science—a branch to which I do not sub­scribe. Of course I don’t know what he would say, but maybe it’s a worth­while ex­er­cise for me to at least try. Well, first, I think he would re­ject the idea that there’s a “neo­cor­tex sub­sys­tem”. And I think he would more gen­er­ally re­ject the idea that there is any in­ter­est­ing ques­tion along the lines of “how does the re­ward sys­tem know that the rat is think­ing about salt?”. Of course I want to pose that ques­tion, be­cause I come from a per­spec­tive of “things need to learned from scratch” (again see My Com­pu­ta­tional Frame­work for the Brain). But Pinker would not be com­ing from that per­spec­tive. I think he wants to as­sume that a com­par­a­tively elab­o­rate world-mod­el­ing in­fras­truc­ture is already in place, hav­ing been hard­coded by the genome. So maybe he would say there’s a built-in “diet mod­ule” which can model and un­der­stand food, taste, satiety, etc., and he would say there’s a built-in “nav­i­ga­tion mod­ule” which can plan a route to walk over to the lever, and he would there’s a built-in “3D mod­el­ing mod­ule” which can make sense of the room and lever, etc. etc.

OK, now that pos­si­bly-straw­man-Steven-Pinker has had his say in the pre­vi­ous para­graph, I can re­spond. I don’t think this is so far off as a de­scrip­tion of the calcu­la­tions done by an adult brain. In ML we talk about “how the learn­ing al­gorithm works” (SGD, BatchNorm, etc.), and sep­a­rately (and much less fre­quently!) we talk about “how the trained model works” (OpenAI Micro­scope, etc.). I want to put all that in­fras­truc­ture in the pre­vi­ous para­graph at the “trained model” level, not the “learn­ing al­gorithm” level. Why? First, be­cause I think there’s pretty good ev­i­dence for cor­ti­cal unifor­mity. Se­cond—and I know this sounds stupid—be­cause I per­son­ally am un­able to imag­ine how this setup would work in de­tail. How ex­actly do you in­sert learned con­tent into the in­nate frame­work? How ex­actly do you in­ter­face the differ­ent mod­ules with each other? And so on. Ob­vi­ously, yes I know, it’s pos­si­ble that an­swers ex­ist, even if I can’t figure them out. But that’s where I’m at right now.