Inner alignment in the brain

Ab­stract: We can think of the brain crudely as (1) a neo­cor­tex which runs an amaz­ingly ca­pa­ble quasi-gen­eral-pur­pose learn­ing-and-plan­ning al­gorithm, and (2) sub­cor­ti­cal struc­tures (mid­brain, etc.), one of whose func­tions is to calcu­late re­wards that get sent to up the neo­cor­tex to di­rect it. But the re­la­tion­ship is ac­tu­ally more com­pli­cated than that. “Re­ward” is not the only in­for­ma­tional sig­nal sent up to the neo­cor­tex; mean­while in­for­ma­tion is also flow­ing back down in the op­po­site di­rec­tion. What’s go­ing on? How does all this work? Where do emo­tions fit in? Well, I’m still con­fused on many points, but I think I’m mak­ing progress. In this post I will de­scribe my cur­rent pic­ture of this sys­tem.

Back­ground & motivation

I’m in­ter­ested in helping en­sure a good post-AGI fu­ture. But how do we think con­cretely about AGI, when AGI doesn’t ex­ist and we don’t know how to build it? Three paths:

  1. We can think gen­er­ally about the na­ture of in­tel­li­gence and agency—a re­search pro­gram fa­mously as­so­ci­ated with MIRI, Mar­cus Hut­ter, etc.;

  2. We can think about to­day’s AI sys­tems—a re­search pro­gram fa­mously as­so­ci­ated with OpenAI, Deep­Mind, CHAI, etc.;

  3. We can start from the one “gen­eral in­tel­li­gence” we know about, i.e. the hu­man brain, and try to go from there to les­sons about how AGI might be built, what it might look like, and how it might be safely and benefi­cially used and con­trol­led.

I like this 3rd re­search pro­gram; it seems to be al­most com­pletely ne­glected,[1] and I think there’s a ton of low-hang­ing fruit there. Also, this pro­gram will be es­pe­cially im­por­tant if we build AGI in part by re­verse-en­g­ineer­ing (or rein­vent­ing) high-level neo­cor­ti­cal al­gorithms, which (as dis­cussed be­low) I think is very plau­si­ble, maybe even likely—for bet­ter or worse.

Now, the brain is di­vided into the neo­cor­tex (75% of the brain by weight), and the sub­cor­tex (the other 25%).

Start with the neo­cor­tex[2] The neo­cor­tex does es­sen­tially all the cool ex­cit­ing in­tel­li­gent things that hu­mans do, like build­ing an in­tel­li­gent world-model in­volv­ing com­po­si­tion and hi­er­ar­chies and coun­ter­fac­tu­als and analo­gies and meta-cog­ni­tion etc., and us­ing that thing to cure dis­eases and build rocket ships and cre­ate cul­ture etc. Thus, both neu­ro­scien­tists and AI re­searchers fo­cus a lot of at­ten­tion onto the neo­cor­tex, and on un­der­stand­ing and re­verse-en­g­ineer­ing its al­gorithms. Text­books di­vide the neo­cor­tex into lots of func­tional re­gions like “mo­tor cor­tex” and “vi­sual cor­tex” and “frontal lobe” etc., but micro­scop­i­cally it’s all a pretty uniform 6-layer struc­ture, and I cur­rently be­lieve that all parts of the neo­cor­tex are perform­ing more-or-less the same al­gorithm, but with differ­ent in­put and out­put con­nec­tions. Th­ese con­nec­tions are seeded by an in­nate gross wiring di­a­gram and then ed­ited by the al­gorithm it­self. See Hu­man In­stincts, Sym­bol Ground­ing, and the Blank-Slate Neo­cor­tex for dis­cus­sion and (heavy!) caveats on that claim. And what is this al­gorithm? I out­line some of (what I think are) the high-level speci­fi­ca­tions at Pre­dic­tive cod­ing = RL + SL + Bayes + MPC. In terms of how the al­gorithm ac­tu­ally works, I think that re­searchers are mak­ing fast progress to­wards figur­ing this out, and that a com­plete an­swer is already start­ing to crys­tal­lize into view on the hori­zon. For a crash course on what’s known to­day on how the neo­cor­tex does its thing, maybe a good start­ing point would be to read On In­tel­li­gence and then ev­ery pa­per ever writ­ten by Dileep Ge­orge (and cita­tions therein).

The sub­cor­tex, by con­trast, is not a sin­gle con­figu­ra­tion of neu­rons tiled over a huge vol­ume, but rather it is a col­lec­tion of quite di­verse struc­tures like the amyg­dala, cere­bel­lum, tec­tum, and so on. Un­like the neo­cor­tex, this stuff does not perform some mirac­u­lous com­pu­ta­tion light-years be­yond to­day’s tech­nol­ogy; as far as I can tell, it ac­com­plishes the same sorts of things as AlphaS­tar does. And the most im­por­tant thing to un­der­stand (for AGI safety) is this:

The sub­cor­tex pro­vides the train­ing sig­nals that guide the neo­cor­tex to do biolog­i­cally-use­ful things.[3]

Now, if peo­ple build AGI that uses al­gorithms similar to the neo­cor­tex, we will need to provide it with train­ing sig­nals. What ex­actly are these train­ing sig­nals? What in­ner al­ign­ment is­sues might they pre­sent? Sup­pose we wanted to make an AGI that was pro-so­cial for the same un­der­ly­ing rea­son as hu­mans are (some­times) pro-so­cial (i.e., thanks to the same com­pu­ta­tion); is that pos­si­ble, how would we do it, and would it work re­li­ably? Th­ese are ques­tions we should an­swer well be­fore we finish re­verse-en­g­ineer­ing the neo­cor­tex. I mean, re­ally these ques­tions should have been an­swered be­fore we even started re­verse-en­g­ineer­ing the neo­cor­tex!! I don’t have an­swers to those ques­tions, but I’m try­ing to lay ground­work in that di­rec­tion. Bet­ter late than never…

Things to keep in mind

Be­fore we get into the weeds, here are some ad­di­tional men­tal pic­tures we’ll need go­ing for­ward:

Sim­ple ex­am­ple: Fear of spiders

My go-to ex­am­ple for the re­la­tion be­tween sub­cor­tex and neo­cor­tex is fear of spi­ders.[4] Be­sides the vi­sual cor­tex, hu­mans have a lit­tle-known sec­ond vi­sion sys­tem in the mid­brain (su­pe­rior col­licu­lus). When you see a black scut­tling thing in your field of view, the mid­brain vi­sion sys­tem de­tects that and sends out a re­ac­tion that makes us look in that di­rec­tion and in­crease our heart rate and flinch away from it. Mean­while, the neo­cor­tex is si­mul­ta­neously see­ing the spi­der with its vi­sion sys­tem, and it’s see­ing the hor­mones and bod­ily re­ac­tion go­ing on, and it con­nects the dots to learn that “spi­ders are scary”. In the fu­ture, if the neo­cor­tex merely imag­ines a spi­der, it might cause your heart to race and body to flinch. On the other hand, af­ter ex­po­sure ther­apy, we might be able to re­main calm when imag­in­ing or even see­ing a spi­der. How does all this work?

(Note again the differ­ent ca­pa­bil­ities of the mid­brain and neo­cor­tex: The mid­brain has cir­cuitry to rec­og­nize black scut­tling things—kinda like to­day’s CNNs can—whereas the neo­cor­tex is able to con­struct and use a rich se­man­tic cat­e­gory like “spi­ders”.)

We’ll be re­turn­ing to this ex­am­ple over and over in the post, try­ing to work through how it might be im­ple­mented and what the con­se­quences are.

The neo­cor­tex is a black box from the per­spec­tive of the subcortex

The neo­cor­tex’s al­gorithm, as I un­der­stand it, sorta learns pat­terns, and pat­terns in the pat­terns, etc., and each pat­tern is rep­re­sented as an es­sen­tially ran­domly-gen­er­ated[5] set of neu­rons in the neo­cor­tex. So, if X is a con­cept in your neo­cor­ti­cal world-model, there is no straight­for­ward way for an in­nate in­stinct to re­fer di­rectly to X—say, by wiring ax­ons from the neu­rons rep­re­sent­ing X to the re­ward cen­ter—be­cause X’s neu­rons are not at pre­de­ter­mined lo­ca­tions. X is in­side the black box. An in­stinct can in­cen­tivize X, at least to some ex­tent, but it has to be done in­di­rectly.

I made a list of var­i­ous ways that we can have uni­ver­sal in­stincts de­spite the neo­cor­tex be­ing a black-box learn­ing al­gorithm: See Hu­man In­stincts, Sym­bol Ground­ing, and the Blank-Slate Neo­cor­tex for my list.

This blog post is a much deeper dive into how a cou­ple of these mechanisms might be ac­tu­ally im­ple­mented.

Gen­eral picture

Fi­nally, here is the cur­rent pic­ture in my head:

There’s a lot here. Let’s go through it bit by bit.

Emo­tions, “emo­tion con­cepts”, and “re­ac­tions”

One as­pect of this pic­ture is emo­tions. There’s a school of thought, pop­u­larized by Paul Ek­man and the movie In­side Out, that there are ex­actly six emo­tions (anger, dis­gust, fear, hap­piness, sad­ness, sur­prise), each with its own uni­ver­sal fa­cial ex­pres­sion. (I’ve seen other lists of emo­tions too, and some­times there’s also a list of so­cial emo­tions like em­bar­rass­ment, jeal­ousy, guilt, shame, pride, etc.) That was my be­lief too, un­til I read the book How Emo­tions Are Made by Lisa Feld­man Bar­rett, which con­vinc­ingly ar­gues against it. Bar­rett ar­gues that a word like “anger” lumps to­gether a lot of very differ­ent bod­ily re­sponses in­volv­ing differ­ent fa­cial ex­pres­sions, hor­mones, etc. Ba­si­cally, emo­tional con­cepts, like other con­cepts, are ar­bi­trary cat­e­gories de­scribing things that we find use­ful to lump to­gether. Sure, they might be lumped to­gether be­cause they share a com­mon hor­mone change or a com­mon fa­cial ex­pres­sion, but they might just as likely be lumped to­gether be­cause they share a com­mon situ­a­tional con­text, or a com­mon set of as­so­ci­ated so­cial norms, or what­ever else. And an emo­tion con­cept with an English-lan­guage name like “anger” is not fun­da­men­tally differ­ent from an idiosyn­cratic emo­tion con­cept like “How Alice must have felt in that TV epi­sode where...”.

(In­ci­den­tally, while I think Bar­rett’s book is right about that, I am definitely not blan­ket-en­dors­ing the whole book—there are a lot of other claims in it that I don’t agree with, or per­haps don’t un­der­stand.[6] I think Bar­rett would strongly dis­agree with most of this blog post, though I could be wrong.)

So in­stead of putting “emo­tions” in the sub­cor­tex, I in­stead put there a bunch of things I’m call­ing “re­ac­tions” for clar­ity.[7] I imag­ine that there are dozens to hun­dreds of these (...and sep­a­rat­ing them into a dis­crete list is prob­a­bly an over­sim­plifi­ca­tion of a more com­pli­cated com­pu­ta­tional ar­chi­tec­ture, but I will any­way). There’s the re­ac­tion that gets trig­gered when your mid­brain vi­sion sys­tem sees a spi­der mov­ing to­wards you out of the cor­ner of your eye, as dis­cussed above. And there’s a differ­ent re­ac­tion that gets trig­gered when you stand at the edge of a precipice and peer over the edge. Both of those re­ac­tions might be cat­e­go­rized as “fear” in the neo­cor­tex, but they’re re­ally differ­ent re­ac­tions, in­volv­ing (I pre­sume) differ­ent changes to heart rate, differ­ent bod­ily mo­tions, differ­ent fa­cial ex­pres­sions, differ­ent quan­tities of (nega­tive) re­ward, etc. (Re­ac­tions where periph­eral vi­sion is helpful will sum­mon a wide-eyed fa­cial ex­pres­sion; re­ac­tions where vi­sual acu­ity is helpful will sum­mon a nar­row-eyed fa­cial ex­pres­sion; and so on.)

As de­scribed above for the spi­der ex­am­ple, the neo­cor­tex can see what the sub­cor­tex does to our hor­mones, body, face, etc., and it can learn to pre­dict that, and build those ex­pec­ta­tions into its pre­dic­tive world-model, and cre­ate con­cepts around that.

(I also put “pain con­cept” in the neo­cor­tex, again fol­low­ing Bar­rett. A gi­ant part of the pain con­cept is no­ci­cep­tion—de­tect­ing the in­com­ing nerve sig­nals we might call “pain sen­sa­tions”. But at the end of the day, the neo­cor­tex gets to de­cide whether or not to clas­sify a situ­a­tion as “pain”, based on not only no­ci­cep­tion but also things like con­text and valence.)

The neo­cor­tex’s non-mo­tor outputs

From the above, our neo­cor­tex comes to ex­pect that if we see a scut­tling spi­der out of the cor­ner of our eye, our heart will race and we’ll turn to­wards it and flinch away. What’s miss­ing from this pic­ture? The neo­cor­tex caus­ing our heart to race by an­ti­ci­pat­ing a spi­der. It’s easy to see why this would be evolu­tion­ar­ily use­ful: If I know (with my neo­cor­tex) that a poi­sonous spi­der is ap­proach­ing, it’s ap­pro­pri­ate for my heart to start rac­ing even be­fore my mid­brain sees the black scut­tling blob.

Now we’re at the top-left ar­row in the di­a­gram above: the neo­cor­tex caus­ing (in this case) re­lease of stress hor­mones. How does the neo­cor­tex learn to do that?

There are two parts of this “how” ques­tion: (1) what are the ac­tual out­put knobs that the neo­cor­tex can use, and (2) how does the neo­cor­tex de­cide to use them? For (1), I have no idea. For the pur­pose of this blog post, let us as­sume that there is a set of out­go­ing ax­ons from the neo­cor­tex that (di­rectly or in­di­rectly) cause hor­mone re­lease, and also as­sume that “hor­mone re­lease” is the right thing to be talk­ing about in terms of con­trol­ling valence, arousal, and so on. I have very low con­fi­dence in all this, but I don’t think it mat­ters much for what I want to say in this post.

I mainly want to dis­cuss ques­tion (2): given these out­put knobs, how does the neo­cor­tex de­cide to use them?

Re­call again that in pre­dic­tive cod­ing, the neo­cor­tex finds gen­er­a­tive mod­els which are con­sis­tent with each other, which have not been re­peat­edly falsified, and which pre­dict that re­ward will hap­pen.

My first thought was: No ad­di­tional in­gre­di­ents, be­yond that nor­mal pre­dic­tive cod­ing pic­ture, are needed to get the neo­cor­tex to imi­tate the sub­cor­ti­cal hor­mone out­puts. Re­mem­ber, just like my post on pre­dic­tive cod­ing and mo­tor con­trol, the neo­cor­tex will dis­cover and store gen­er­a­tive mod­els that en­tail “self-fulfilling prophe­cies”, where a sin­gle gen­er­a­tive model in the neo­cor­tex si­mul­ta­neously codes for a pre­dic­tion of stress hor­mone and the neo­cor­ti­cal out­put sig­nals that ac­tu­ally cause the re­lease of this stress hor­mone. Thus (...I ini­tially thought...), af­ter see­ing spi­ders and stress hor­mones a few times, the neo­cor­tex will pre­dict stress hor­mones when it sees a spi­der, which in­ci­den­tally cre­ates stress hor­mones.

But I don’t think that’s the right an­swer, at least not by it­self. After all, the neo­cor­tex will also learn a gen­er­a­tive model where stress hor­mone is gen­er­ated ex­oge­nously (e.g. by the sub­cor­ti­cal spi­der re­ac­tion) and where the neo­cor­tex’s own stress hor­mone gen­er­a­tion knob is left un­touched. This lat­ter model is is­su­ing perfectly good pre­dic­tions, so there is no rea­son that the neo­cor­tex would spon­ta­neously throw it out and start us­ing in­stead the self-fulfilling-prophecy model. (By the same to­ken, in the mo­tor con­trol case, if I think you are go­ing to take my limp arm and lift it up, I have no prob­lem pre­dict­ing that my arm will move due to that ex­oge­nous force; my neo­cor­tex doesn’t get con­fused and start is­su­ing mo­tor com­mands.)

So here’s my sec­ond, bet­ter story:

Re­ward crite­rion (one among many): when the sub­cor­tex calls for a re­ac­tion (e.g. cor­ti­sol re­lease, eyes widen­ing, etc.), it re­wards the neo­cor­tex with dopamine if it sees that those com­mands have some­how already been is­sued.

(Up­date 2020/​10: I now think this isn’t quite how it’s done … see here.)

So if the sub­cor­tex com­putes that a situ­a­tion calls for cor­ti­sol, the neo­cor­tex is re­warded if the sub­cor­tex sees that cor­ti­sol is already flow­ing. This ex­am­ple seems in­tro­spec­tively rea­son­able: See­ing a spi­der out of the cor­ner of your eye is bad, but be­ing sur­prised to see a spi­der when you were feel­ing safe and re­laxed is even worse (worse in terms of dopamine, not nec­es­sar­ily worse in terms of valence—re­mem­ber want­ing ≠ lik­ing). Pre­sum­ably the same prin­ci­ple can ap­ply to eye-widen­ing and other things.

To be clear, this is one re­ward crite­rion among many—the sub­cor­tex is­sues pos­i­tive and nega­tive re­wards ac­cord­ing to other crite­ria too (as in the di­a­gram above, I think differ­ent re­ac­tions in­her­ently is­sue pos­i­tive or nega­tive re­wards to the neo­cor­tex, just like they in­her­ently is­sue mo­tor com­mands and hor­mone com­mands). But as long as this “re­ward crite­rion” above is per­ma­nently in place, then thanks to the laws of the neo­cor­tex’s gen­er­a­tive model econ­omy, the neo­cor­tex will drop those gen­er­a­tive mod­els that pas­sively an­ti­ci­pate the sub­cor­tex’s re­ac­tions, in fa­vor of mod­els that ac­tively an­ti­ci­pate /​ imi­tate the sub­cor­ti­cal re­ac­tions, in­so­far as that’s pos­si­ble (the neo­cor­tex doesn’t have out­put knobs for ev­ery­thing).

Pre­dict­ing, imag­in­ing, re­mem­ber­ing, empathizing

The neo­cor­tex’s gen­er­a­tive mod­els ap­pear in the con­text of (1) pre­dic­tion (in­clud­ing pre­dict­ing the im­me­di­ate fu­ture as it hap­pens), (2) imag­i­na­tion, (3) mem­ory, and (4) em­pa­thetic simu­la­tion (when we imag­ine some­one else re­act­ing to a spi­der, pre­dict­ing a spi­der, etc.). I think all four of these pro­cesses rely on fun­da­men­tally the same mechanism in the neo­cor­tex, so by de­fault the same gen­er­a­tive mod­els will be used for all four. Thus, we get the same hor­mone out­puts in all four of these situ­a­tions.

Hang on, you say: That doesn’t seem right! If it were the ex­act same gen­er­a­tive mod­els, then when we re­mem­ber danc­ing, we would ac­tu­ally is­sue the mo­tor com­mands to start danc­ing! Well, I an­swer, we do ac­tu­ally some­times move a lit­tle bit when we re­mem­ber a mo­tion! I think the rule is, loosely speak­ing, the top-down in­for­ma­tion flow is much stronger (more con­fi­dent) when pre­dict­ing, and much weaker for imag­i­na­tion, mem­ory, and em­pa­thy. Thus, the neo­cor­ti­cal out­put sig­nals are weaker too, and this ap­plies to both mo­tor con­trol out­puts and hor­mone out­puts. (In­ci­den­tally, I think mo­tor con­trol out­puts are fur­ther sub­ject to thresh­old­ing pro­cesses, down­stream of the neo­cor­tex, and there­fore a suffi­ciently weak mo­tor com­mand causes no mo­tion at all.)

As dis­cussed more be­low, the sub­cor­tex re­lies on the neo­cor­tex’s out­puts to guess what the neo­cor­tex is think­ing about, and is­sue evolu­tion­ar­ily-ap­pro­pri­ate guidance in re­sponse. Pre­sum­ably, to do this job well, the sub­cor­tex needs to know whether a given neo­cor­ti­cal out­put is part of a pre­dic­tion, or mem­ory, or imag­i­na­tion, or em­pa­thetic simu­la­tion. From the above para­graph, I think it can dis­t­in­guish pre­dic­tions from the other three by the neo­cor­ti­cal out­put strength. But how does it tell mem­ory, imag­i­na­tion, and em­pa­thetic simu­la­tion apart from each other? I don’t know! Then that sug­gests to me an in­ter­est­ing hy­poth­e­sis: maybe it can’t! What if some of our weirder in­stincts re­lated to mem­ory or coun­ter­fac­tual imag­i­na­tion are not adap­tive at all, but rather crosstalk from so­cial in­stincts, or vice-versa? For ex­am­ple, I think there’s a re­ac­tion in the sub­cor­tex that listens for a strong pre­dic­tion of lower re­ward, al­ter­nat­ing with a weak pre­dic­tion of higher re­ward; when it sees this com­bi­na­tion, it is­sues nega­tive re­ward and nega­tive valence. Think about what this sub­cor­ti­cal re­ac­tion would do in the three differ­ent cases: If the weak pre­dic­tion it sees is an em­pa­thetic simu­la­tion, well, that’s the core of jeal­ousy! If the weak pre­dic­tion it sees is a mem­ory, well, that’s the core of loss aver­sion! If the weak pre­dic­tion it sees is a coun­ter­fac­tual imag­i­na­tion, well, that’s the core of, I guess, that an­noy­ing feel­ing of hav­ing missed out on some­thing good. Seems to fit to­gether pretty well, right? I’m not su­per con­fi­dent, but at least it’s food for thought.

Var­i­ous implications

Open­ing a win­dow into the black-box neocortex

Each sub­cor­ti­cal re­ac­tion has its own pro­file of fa­cial, body, and hor­mone changes. The “re­ward crite­rion” above en­sures that the neo­cor­tex will learn to imi­tate the char­ac­ter­is­tic con­se­quences of re­ac­tion X when­ever it is ex­pect­ing, imag­in­ing, re­mem­ber­ing, or em­pa­thet­i­cally simu­lat­ing re­ac­tion X. This is then a win­dow for the sub­cor­tex to get a glimpse into the go­ings-on in­side the black-box neo­cor­tex.

In our run­ning ex­am­ple, if the spi­der re­ac­tion cre­ates a cer­tain com­bi­na­tion of fa­cial, body, and hor­mone changes, then the sub­cor­tex can watch for this set of changes to hap­pen ex­oge­nously (from its per­spec­tive), and if it does, the sub­cor­tex can in­fer that the neo­cor­tex was maybe think­ing about spi­ders. Per­haps the sub­cor­tex might then is­sue its own spi­der re­ac­tion, flesh­ing out the neo­cor­tex’s weak imi­ta­tion. Or per­haps it could do some­thing en­tirely differ­ent.

I have a hunch that so­cial emo­tions rely on this. With this mechanism, it seems that the sub­cor­tex can build a hi­er­ar­chy of in­creas­ingly com­pli­cated so­cial re­ac­tions: “if I’m ac­ti­vat­ing re­ac­tion A, and I think you’re ac­ti­vat­ing re­ac­tion B, then that trig­gers me to feel re­ac­tion C”, “if I’m ac­ti­vat­ing re­ac­tion C, and I think you’re ac­ti­vat­ing re­ac­tion A, then that trig­gers me to feel re­ac­tion D”, and so on. Well, maybe. I’m still hazy on the de­tails here and want to think about it more.

Com­plex back-and-forth be­tween neo­cor­tex and subcortex

The neo­cor­tex can al­ter the hor­mones and body, which are among the in­puts into the sub­cor­ti­cal cir­cuits. The sub­cor­ti­cal cir­cuits then also al­ter the hor­mones and body, which are among the in­puts into the neo­cor­tex! Around and around it goes! So for ex­am­ple, if you tell your­self to calm down, your neo­cor­tex changes your hor­mones, which in turn in­creases the ac­ti­va­tion of the sub­cor­ti­cal “I am safe and calm” re­ac­tion, which re­in­forces and aug­ments that change, which in turn makes it eas­ier for the neo­cor­tex to con­tinue feel­ing safe and calm! … Un­til, of course, that pleas­ant cy­cle is bro­ken by other sub­cor­ti­cal re­ac­tions or other neo­cor­ti­cal gen­er­a­tive mod­els butting in.

“Over­com­ing” sub­cor­ti­cal reactions

Em­piri­cally, we know it’s pos­si­ble to “over­come” fear of spi­ders, and other sub­cor­ti­cal re­ac­tions. I’m think­ing there are two ways this might work. I think both are hap­pen­ing, but I’m not re­ally sure.

First, there’s sub­cor­ti­cal learn­ing … well, “learn­ing” isn’t the right word here, be­cause it’s not try­ing to match some ground truth. (The only “ground truth” for sub­cor­ti­cal re­ac­tion speci­fi­ca­tions is nat­u­ral se­lec­tion!) I think it’s more akin to the self-mod­ify­ing code in Linux than to the weight up­dates in ML. So let’s call it sub­cor­ti­cal in­put-de­pen­dent dy­namic rewiring rules.

(By the way, el­se­where in the sub­cor­tex, like the cere­bel­lum, there is also real stereo­typ­i­cal “learn­ing” go­ing on, akin to the weight up­dates in ML. That does hap­pen, but it’s not what I’m talk­ing about here.)

Maybe one sub­cor­ti­cal dy­namic rewiring rule says: If the spi­der-de­tec­tion re­ac­tion trig­gers, and then within 3 sec­onds the “I am safe and calm” re­ac­tion trig­gers, then next time the spi­der re­ac­tion should trig­ger more weakly.

Se­cond, there’s neo­cor­ti­cal learn­ing—i.e., the neo­cor­tex de­vel­op­ing new gen­er­a­tive mod­els. Let’s say again that we’re do­ing ex­po­sure ther­apy for fear of spi­ders, and let’s say the two rele­vant sub­cor­ti­cal re­ac­tions are the spi­der-de­tec­tion re­ac­tion (which re­wards the neo­cor­tex for pro­duc­ing anx­iety hor­mones be­fore it trig­gers) and the “I am safe and calm” re­ac­tion (which re­wards the neo­cor­tex for for pro­duc­ing calming hor­mones be­fore it trig­gers). (I’m ob­vi­ously over­sim­plify­ing here.) The neo­cor­tex could learn gen­er­a­tive mod­els that sum­mon the “I am safe and calm” re­ac­tion when­ever the spi­der-de­tec­tion re­ac­tion is just start­ing to trig­ger. That gen­er­a­tive model could po­ten­tially get en­trenched and re­warded, as the spi­der-de­tec­tion re­ac­tion is sorta pre­empted and thus can’t is­sue a penalty for the lack of anx­iety hor­mones, whereas the “I am safe and calm” re­ac­tion does is­sue a re­ward for the pres­ence of calm hor­mones. Some­thing like that?

I have no doubt that the sec­ond of these two pro­cesses—neo­cor­ti­cal learn­ing—re­ally hap­pens. The first might or might not hap­pen, I don’t know. It does seem like some­thing that plau­si­bly could hap­pen, on both evolu­tion­ary and neu­rolog­i­cal grounds. So I guess my de­fault as­sump­tion is that dy­namic rewiring rules for sub­cor­ti­cal re­ac­tions do in fact ex­ist, but again, I’m not sure, I haven’t thought about it much.

Things I still don’t understand

I lumped to­gether the sub­cor­tex into a mono­lithic unit. I ac­tu­ally un­der­stand very lit­tle about the func­tional de­com­po­si­tion be­yond that. The tec­tum and tegmen­tum seem to be do­ing a lot of the calcu­la­tions for what I’m call­ing “re­ac­tions”, in­clud­ing the col­li­culi, which seem to house the sub­cor­ti­cal sen­sory pro­cess­ing. What com­pu­ta­tions does the amyg­dala do, for ex­am­ple? It has 10 mil­lion neu­rons, they have to be calcu­lat­ing some­thing!!! I re­ally don’t know.

As dis­cussed above, I don’t un­der­stand what the non-mo­tor out­put sig­nals from the neo­cor­tex are, or whether things like valence and arousal cor­re­spond to hor­mones or some­thing else.

I’m more gen­er­ally un­cer­tain about ev­ery­thing I wrote here, even where I used a con­fi­dent tone. Hon­estly, I haven’t found much in the sys­tems neu­ro­science liter­a­ture that’s ad­dress­ing the ques­tions I’m in­ter­ested in, al­though I imag­ine it’s there some­where and I’m rein­vent­ing lots of wheels (or re-mak­ing clas­sic mis­takes). As always, please let me know any thoughts, ideas, things you find con­fus­ing, etc. Thanks in ad­vance!

  1. A few peo­ple on this fo­rum are think­ing hard about the brain, and I’ve learned a lot from their writ­ings—es­pe­cially Kaj’s multi-agent se­quence—but my im­pres­sion is that they’re mostly work­ing on the pro­ject of “Let’s un­der­stand the brain so we can an­swer nor­ma­tive ques­tions of what we want AGI to do and how value learn­ing might work”, whereas here I’m talk­ing about “Let’s un­der­stand the brain as a model of a pos­si­ble AGI al­gorithm, and think about whether such an AGI al­gorithm can be used safely and benefi­cially”. Y’all can cor­rect me if I’m wrong :) ↩︎

  2. I will slop­pily use the term “neo­cor­tex” as short­hand for “neo­cor­tex plus other struc­tures that are in­ti­mately con­nected to the neo­cor­tex and are best thought of as part of the same al­gorithm”—this es­pe­cially in­cludes the hip­pocam­pus and tha­la­mus. ↩︎

  3. For what it’s worth, Elon Musk men­tioned in a re­cent in­ter­view about Neu­ral­ink that he is think­ing about the brain this way as well: “We’ve got like a mon­key brain with a com­puter on top of it, that’s the hu­man brain, and a lot of our im­pulses and ev­ery­thing are driven by the mon­key brain, and the com­puter, the cor­tex, is con­stantly try­ing to make the mon­key brain happy. It’s not the cor­tex that’s steer­ing the mon­key brain, it’s the mon­key brain steer­ing the cor­tex.” (14:45). Nor­mally peo­ple would say “lizard brain” rather than “mon­key brain” here, al­though even that ter­minol­ogy is un­fair to lizards, who do in fact have some­thing ho­molo­gous to a neo­cor­tex. ↩︎

  4. Un­for­tu­nately I don’t have good ev­i­dence that this spi­der story is ac­tu­ally true. Does the mid­brain re­ally have spe­cial­ized cir­cuitry to de­tect spi­ders? There was a study that showed pic­tures of spi­ders to a blind­sighted per­son (i.e., a per­son who had an in­tact mid­brain vi­sual pro­cess­ing sys­tem but no vi­sual cor­tex). It didn’t work; noth­ing hap­pened. But I think they did the ex­per­i­ment wrong—I think it has to be a video of a mov­ing spi­der, not a sta­tion­ary pic­ture of a spi­der, to trig­ger the sub­cor­ti­cal cir­cuitry. (Source: in­tro­spec­tion. Also, I think I read that the sub­cor­ti­cal vi­sion sys­tem has pretty low spa­tial re­s­olu­tion, so watch­ing for a char­ac­ter­is­tic mo­tion would seem a sen­si­ble de­sign.) Any­way, it has to work this way, noth­ing else makes sense to me. I’m com­fortable us­ing this ex­am­ple promi­nently be­cause if it turns out that this ex­am­ple is wrong, then I’m so very con­fused that this whole ar­ti­cle is prob­a­bly garbage any­way. For the record, I am ba­si­cally de­scribing Mark John­son’s “two-pro­cess” model, which is I think well es­tab­lished in the case of at­tend­ing-to-faces in hu­mans and filial im­print­ing in chicks (more here), even if it’s spec­u­la­tive when ap­plied to fear-of-spi­ders. ↩︎

  5. I am pretty con­fi­dent that neo­cor­ti­cal pat­terns are effec­tively ran­dom at a micro­scopic, neu­ron-by-neu­ron level, and that’s what mat­ters when we talk about why it’s im­pos­si­ble for evolu­tion to di­rectly cre­ate hard­wired in­stincts that re­fer to a se­man­tic con­cept in the neo­cor­tex. How­ever, to be clear, at the level of gross anatomy, you can more-or-less pre­dict in ad­vance where differ­ent con­cepts will wind up get­ting stored in the neo­cor­tex, based on the large-scale pat­terns of in­for­ma­tion flow and the in­puts it gets in a typ­i­cal hu­man life en­vi­ron­ment. To take an ob­vi­ous ex­am­ple, low-level vi­sual pat­terns are likely to be stored in the parts of the neo­cor­tex that re­ceive low-vi­sual vi­sual in­for­ma­tion from the retina! ↩︎

  6. When I say “I didn’t un­der­stand” some­thing Bar­rett wrote, I mean more speci­fi­cally that I can’t see how to turn her words into a gears-level model of a com­pu­ta­tion that the brain might be do­ing. This cat­e­gory of “things I didn’t un­der­stand” in­cludes, in par­tic­u­lar, al­most ev­ery­thing she wrote about “body bud­gets”, which was a ma­jor theme of the book that came up on al­most ev­ery page… ↩︎

  7. If you want to call the sub­cor­ti­cal things “emo­tions” in­stead of “re­ac­tions”, that’s fine with me, as long as you dis­t­in­guish them from “emo­tion con­cepts” in the neo­cor­tex. Bar­rett is re­ally adamant that the word “emo­tion” must re­fer to the neo­cor­ti­cal emo­tion con­cepts, not the sub­cor­ti­cal re­ac­tions (I’m not even sure if she thinks the sub­cor­ti­cal re­ac­tions ex­ist), but for my part, I think rea­son­able peo­ple could differ, and it’s ul­ti­mately a ter­minolog­i­cal ques­tion with no right an­swers any­way. ↩︎