Multi-agent predictive minds and AI alignment

Ab­stract: An at­tempt to map a best-guess model of how hu­man val­ues and mo­ti­va­tions work to sev­eral more tech­ni­cal re­search ques­tions. The mind-model is in­spired by pre­dic­tive pro­cess­ing /​ ac­tive in­fer­ence frame­work and multi-agent mod­els of the mind.

The text has slightly un­usual epistemic struc­ture:

1st part: my cur­rent best-guess model of how hu­man minds work.

2nd part: ex­plores var­i­ous prob­lems which such mind ar­chi­tec­ture would pose for some ap­proaches to value learn­ing. The ar­gu­ment is: if such a model seems at least plau­si­ble, we should prob­a­bly ex­tend the space of ac­tive re­search di­rec­tions.

3rd part: a list of spe­cific re­search agen­das, some­times spe­cific re­search ques­tions, mo­ti­vated by the pre­vi­ous.

I put more cre­dence in the use­ful­ness of re­search ques­tions sug­gested in the third part than in the speci­fics of the model de­scribed the first part. Also, you should be warned I have no for­mal train­ing in cog­ni­tive neu­ro­science and similar fields, and it is com­pletely pos­si­ble I’m mak­ing some ba­sic mis­takes. Still, my feel­ing is even if the model de­scribed in the first part is wrong, some­thing from the broad class of “mo­ti­va­tional sys­tems not nat­u­rally de­scribed by util­ity func­tions” is close to re­al­ity, and un­der­stand­ing prob­lems from the 3rd part can be use­ful.

How minds work

As noted, this is a “best guess model”. I have large un­cer­tainty about how hu­man minds ac­tu­ally work. But if I could place just one bet, I would bet on this.

The model has two pre­req­ui­site ideas: pre­dic­tive pro­cess­ing and the ac­tive in­fer­ence frame­work. I’ll give brief sum­maries and links for el­se­where.

In the pre­dic­tive pro­cess­ing /​ the ac­tive in­fer­ence frame­work, brains con­stantly pre­dict sen­sory in­puts, in a hi­er­ar­chi­cal gen­er­a­tive way. As a dual, ac­tion is also “gen­er­ated” by the same ma­chin­ery (chang­ing en­vi­ron­ment to match “pre­dicted” de­sir­able in­puts and gen­er­at­ing ac­tion which can lead to them). The “cur­rency” on which the whole sys­tem is run­ning is pre­dic­tion er­ror (or some­thing in style of free en­ergy, in that lan­guage).

Another im­por­tant in­gre­di­ent is bounded ra­tio­nal­ity, i.e. a limited amount of re­sources be­ing available for cog­ni­tion. In­deed, the speci­fics of hi­er­ar­chi­cal mod­el­ling, neu­ral ar­chi­tec­tures, prin­ci­ple of reusing and re­pur­pos­ing ev­ery­thing, all seem to be re­lated to quite bru­tal op­ti­miza­tion pres­sure, likely re­lated to brain’s enor­mous en­ergy con­sump­tion (It is un­clear to me if this can be also re­duced to the same “cur­rency”. Karl Fris­ton would prob­a­bly an­swer “yes”).

As­sum­ing this whole, how do mo­ti­va­tions and “val­ues” arise? The guess is, in many cases some­thing like a “sub­pro­gram” is mod­el­ling/​track­ing some vari­able, “pre­dict­ing” its de­sir­able state, and cre­at­ing the need for ac­tion by “sig­nal­ling” pre­dic­tion er­ror. Note that such sub­pro­grams can work on vari­ables on very differ­ent hi­er­ar­chi­cal lay­ers of mod­el­ling—e.g. track­ing a sim­ple vari­able like “feel­ing hun­gry” vs. track­ing a vari­able like “so­cial sta­tus”. Such sub-sys­tems can be large: for ex­am­ple track­ing “so­cial sta­tus” seems to re­quire lot of com­pu­ta­tion.

How does this re­late to emo­tions? Emo­tions could be quite com­plex pro­cesses, where some higher-level mod­el­ling (“I see a lion”) leads to a re­sponse in lower lev­els con­nected to body states, some chem­i­cals are re­leased, and this in­te­ro­cep­tive sen­sa­tion is re-in­te­grated in the higher lev­els in the form of emo­tional state, even­tu­ally reach­ing con­scious­ness. Note that the emo­tional sig­nal from the body is more similar to “sen­sory” data—the guess is body/​low level re­sponses are a way how genes in­sert a re­ward sig­nal into the whole sys­tem.

How does this re­late to our con­scious ex­pe­rience, and stuff like Kah­ne­man’s Sys­tem 1/​Sys­tem 2? It seems for most peo­ple the light of con­scious­ness is illu­mi­nat­ing only a tiny part of the com­pu­ta­tion, and most stuff is hap­pen­ing in the back­ground. Also, S1 has much larger com­put­ing power. On the other hand it seems rel­a­tively easy to “spawn back­ground pro­cesses” from the con­scious part, and it seems pos­si­ble to illu­mi­nate larger part of the back­ground pro­cess­ing than is usu­ally visi­ble through spe­cial­ized tech­niques and efforts (for ex­am­ple, some med­i­ta­tion tech­niques).

Another in­gre­di­ent is the ob­ser­va­tion that a big part of what the con­scious self is do­ing is in­ter­act­ing with other peo­ple, and ra­tio­nal­iz­ing our be­havi­our. (Cf. press sec­re­tary the­ory, elephant in the brain.) It is also quite pos­si­ble the re­la­tion be­tween act­ing ra­tio­nally and the abil­ity to ra­tio­nal­ize what we did is bidi­rec­tional, and sig­nifi­cant part of mo­ti­va­tion for some ra­tio­nal be­havi­our is that it is easy to ra­tio­nal­ize it.

Also, it seems im­por­tant to ap­pre­ci­ate that the most im­por­tant part of the hu­man “en­vi­ron­ment” are other peo­ple, and what hu­man minds are of­ten do­ing is likely simu­lat­ing other hu­man minds (even simu­lat­ing how other peo­ple would be simu­lat­ing some­one else!).

Prob­lems with pre­vailing value learn­ing approaches

While the above sketched pic­ture is just a best guess, it seems to me at least com­pel­ling. At the same time, there are no­table points of ten­sion be­tween it and at least some ap­proaches to AI al­ign­ment.

No clear dis­tinc­tion be­tween goals and beliefs

In this model, it is hardly pos­si­ble to dis­en­tan­gle “be­liefs” and “mo­ti­va­tions” (or val­ues). “Mo­ti­va­tions” in­ter­face with the world only via a com­plex ma­chin­ery of hi­er­ar­chi­cal gen­er­a­tive mod­els con­tain­ing all other sorts of “be­liefs”.
To ap­pre­ci­ate the prob­lems for the value learn­ing pro­gram, con­sider a case of some­one who’s pre­dic­tive/​gen­er­a­tive model strongly pre­dicts failure and suffer­ing. Such per­son may take ac­tions which ac­tu­ally lead to this out­come, min­i­miz­ing the pre­dic­tion er­ror.

Less ex­treme but also im­por­tant prob­lem is that ex­trap­o­lat­ing “val­ues” out­side of the area of val­idity of gen­er­a­tive mod­els is prob­le­matic and could be fun­da­men­tally ill-defined. (This is re­lated to “on­tolog­i­cal crisis”.)

No clear self-alignment

It seems plau­si­ble the com­mon for­mal­ism of agents with util­ity func­tions is more ad­e­quate for de­scribing the in­di­vi­d­ual “sub­sys­tems” than the whole hu­man minds. De­ci­sions on the whole mind level are more like re­sults of in­ter­ac­tions be­tween the sub-agents; re­sults of multi-agent in­ter­ac­tion are not in gen­eral an ob­ject which is nat­u­rally rep­re­sented by util­ity func­tion. For ex­am­ple, con­sider the se­quence of game out­comes in re­peated PD game. If you take the se­quence of game out­comes (e.g. 1: defect-defect, 2:co­op­er­ate-defect, … ) as a se­quence of ac­tions, the ac­tions are not rep­re­sent­ing some well be­haved prefer­ences, and in gen­eral not max­i­miz­ing some util­ity func­tion.

Note: This is not to claim VNM ra­tio­nal­ity is use­less—it still has the nor­ma­tive power—and some types of in­ter­ac­tion lead hu­mans to ap­prox­i­mate SEU op­ti­miz­ing agents bet­ter.

One case is if mainly one spe­cific sub­sys­tem (sub­agent) is in con­trol, and the de­ci­sion does not go via too com­plex gen­er­a­tive mod­el­ling. So, we should ex­pect more VNM-like be­havi­our in ex­per­i­ments in nar­row do­mains than in cases where very differ­ent sub-agents are en­gaged and dis­agree.
Another case is if sub-agents are able to do some “so­cial welfare func­tion” style ag­gre­ga­tion, bar­gain, or trade—the re­sult could be more VNM-like, at least in spe­cific points of time, with the caveat that such “point” ag­gre­gate func­tion may not be pre­served in time.

On the con­trary, cases where the re­sult­ing be­havi­our is very differ­ent from VNM-like may be caused by sub-agents locked in some non-co­op­er­a­tive Nash equil­ibria.

What we are al­ign­ing AI with

Given this dis­tinc­tion be­tween the whole mind and sub-agents, there are at least four some­what differ­ent no­tions of what al­ign­ment can mean.

1. Align­ment with the out­puts of the gen­er­a­tive mod­els, with­out query­ing the hu­man. This in­cludes for ex­am­ple pro­pos­als cen­tered around ap­proval. In this case, gen­er­ally only the out­put of the in­ter­nal ag­gre­ga­tion has some voice.

2. Align­ment with the out­puts of the gen­er­a­tive mod­els, with query­ing the hu­man. This in­cludes for ex­am­ple CIRL and similar ap­proaches. The prob­le­matic part of this is, by care­fully crafted queries, it is pos­si­ble to give voice to differ­ent sub-agenty sys­tems (or with more nu­ance, give them very differ­ent power in the ag­gre­ga­tion pro­cess). One prob­lem with this is, if the in­ter­nal hu­man sys­tem is not self-al­igned, the re­sults could be quite ar­bi­trary (and the AI agent has a lot of power to ma­nipu­late)

3. Align­ment with the whole sys­tem, in­clud­ing the hu­man ag­gre­ga­tion pro­cess it­self. This could in­clude for ex­am­ple some deep NN based black-box trained on a large amount of hu­man data, pre­dict­ing what would the hu­man want (or ap­prove).

4. Ad­ding lay­ers of in­di­rec­tion to the ques­tion, such as defin­ing al­ign­ment as a state where the “A is try­ing to do what H wants it to do.”

In prac­tice, op­tions 1. and 2. can col­lapse into one, as far as there is some feed­back loop be­tween the AI agent ac­tions and the hu­man re­ward sig­nal. (Even in case 1, the agent can take an ac­tion with the in­ten­tion to elicit feed­back from some sub­part.)

We can con­struct a rich space of var­i­ous mean­ings of “al­ign­ment” by com­bin­ing ba­sic di­rec­tions.

Now, we can an­a­lyze how these op­tions in­ter­act with var­i­ous al­ign­ment re­search pro­grams.

Prob­a­bly the most in­ter­est­ing case is IDA. IDA-like schemes can prob­a­bly carry for­ward ar­bi­trary prop­er­ties to more pow­er­ful sys­tems, as long as we are able to con­struct the in­di­vi­d­ual step pre­serv­ing the prop­erty. (I.e. one full cy­cle of dis­til­la­tion and am­plifi­ca­tion, which can be ar­bi­trar­ily small).

Distill­ing and am­plify­ing the al­ign­ment in sense #1 (what the hu­man will ac­tu­ally ap­prove) is con­cep­tu­ally eas­iest, but, un­for­tu­nately, brings some of the prob­lems of po­ten­tially su­per-hu­man sys­tem op­ti­miz­ing for ma­nipu­lat­ing the hu­man for ap­proval.

Align­ment in sense #3 cre­ates a very differ­ent set of prob­lems. One ob­vi­ous risk are mind-crimes. More sub­tle risk is re­lated to the fact that as the im­plicit model of hu­man “wants” scales (be­comes less bounded), I. the parts may scale at differ­ent rates II. the out­come equil­ibria may change even if the sub-parts scale at the same rate.

Align­ment in sense #4 seems more vague, and moves the bur­den of un­der­stand­ing the prob­lem in part to the side of the AI. We can imag­ine that at the end the AI will be al­igned with some part of the hu­man mind in a self-con­sis­tent way (the part will be a fixed point of the al­ign­ment struc­ture). Un­for­tu­nately, it is a pri­ori un­clear if a unique fixed point ex­ists. If not, the prob­lems be­come similar to case #2. Also, it seems in­evitable the AI will need to con­tain some struc­ture rep­re­sent­ing what the hu­man wants the AI to do, which may cause prob­lems similar to #3.

Also, in com­par­i­son with other mean­ings, it is much less clear to me how to even es­tab­lish some sys­tem has this prop­erty.

Rider-cen­tric and meme-cen­tric alignment

Many al­ign­ment pro­pos­als seem to fo­cus on in­ter­act­ing just with the con­scious, nar­rat­ing and ra­tio­nal­iz­ing part of mind. If this is just a one part en­tan­gled in some com­plex in­ter­ac­tion with other parts, there are spe­cific rea­sons why this may be prob­le­matic.

One: if the “rider” (from the rider/​elephant metaphor) is the part highly en­gaged with track­ing so­cietal rules, in­ter­ac­tions and memes. It seems plau­si­ble the “val­ues” learned from it will be mostly al­igned with so­cietal norms and in­ter­ests of meme­plexes, and not “fully hu­man”.

This is wor­ri­some: from a meme-cen­tric per­spec­tive, hu­mans are just a sub­strate, and not nec­es­sar­ily the best one. Also—a more spec­u­la­tive prob­lem may be—schemes learn­ing hu­man memetic land­scape and “su­per­charg­ing it” with su­per­hu­man perfor­mance may cre­ate some hard to pre­dict evolu­tion­ary op­ti­miza­tion pro­cesses.

Me­taprefer­ences and multi-agent alignment

In­di­vi­d­ual “prefer­ences” can of­ten in fact be mostly a meta-prefer­ence to have prefer­ences com­pat­i­ble with other peo­ple, based on simu­la­tions of such peo­ple.

This may make it sur­pris­ingly hard to in­fer hu­man val­ues by try­ing to learn what in­di­vi­d­ual hu­mans want with­out the so­cial con­text (ne­ces­si­tat­ing in­vert­ing sev­eral lay­ers of simu­la­tion). If this is the case, the whole ap­proach of ex­tract­ing in­di­vi­d­ual prefer­ences from a sin­gle hu­man could be prob­le­matic. (This is prob­a­bly more rele­vant to some “pro­saic” al­ign­ment prob­lems)


Some of the above men­tioned points of dis­agree­ments point to­ward spe­cific ways how some of the ex­ist­ing ap­proaches to value al­ign­ment may fail. Sev­eral illus­tra­tive ex­am­ples:

  • In­ter­nal con­flict may lead to in­ac­tion (also to not ex­press­ing ap­proval or dis­ap­proval). While many ex­ist­ing ap­proaches rep­re­sent such situ­a­tion only by the out­come of the con­flict, the in­ter­nal ex­pe­rience of the hu­man seems to be quite differ­ent with and with­out the conflict

  • Difficulty with split­ting “be­liefs” and “mo­ti­va­tions”.

  • Learn­ing in­ad­e­quate so­cietal equil­ibria and op­ti­miz­ing on them.


On the pos­i­tive side, it could be ex­pected the sub-agents still eas­ily agree on things like “it is bet­ter not to die a hor­rible death”.

Also, the mind-model with bounded sub-agents which in­ter­act only with their lo­cal neigh­bor­hood and do not ac­tu­ally care about the world may be a vi­able de­sign from the safety per­spec­tive.

Suggested tech­ni­cal re­search directions

While the pre­vi­ous parts are more in back­ward-chain­ing mode, here I at­tempt to point to­ward more con­crete re­search agen­das and ques­tions where we can plau­si­bly im­prove our un­der­stand­ing ei­ther by de­vel­op­ing the­ory, or ex­per­i­ment­ing with toy mod­els based on cur­rent ML tech­niques.

Often it may be the case that some re­search was already done on the topic, just not with AI al­ign­ment in mind, and a high value work could be “im­port­ing the knowl­edge” into safety com­mu­nity.

Un­der­stand­ing hi­er­ar­chi­cal mod­el­ling.

It seems plau­si­ble the hu­man hi­er­ar­chi­cal mod­els of the world op­ti­mize some “bound­edly ra­tio­nal” func­tion. (Re­mem­ber­ing all de­tails is too ex­pen­sive, too much coarse-grain­ing de­creases use­ful­ness. A good bounded ra­tio­nal­ity model can work as a prin­ci­ple for how to se­lect mod­els. In a similar way to the min­i­mum de­scrip­tion length prin­ci­ple, just tak­ing some more “hu­man” (en­ergy?) costs as cost func­tion.)

In­verse Game The­ory.

In­vert­ing agent mo­ti­va­tions in MDPs is a differ­ent prob­lem from in­vert­ing mo­ti­va­tions in multi-agent situ­a­tions where game-the­ory style in­ter­ac­tions oc­cur. This leads to the in­verse game the­ory prob­lem: ob­serve the in­ter­ac­tions, learn the ob­jec­tives.

Learn­ing from mul­ti­ple agents.

Imag­ine a group of five closely in­ter­act­ing hu­mans. Learn­ing val­ues just from per­son A may run into the prob­lem that big part of A’s mo­ti­va­tion is based on A simu­lat­ing B,C,D,E (on the same “hu­man” hard­ware, just in­cor­po­rat­ing in­di­vi­d­ual differ­ences). In that case, learn­ing the “val­ues” just from A’s ac­tions could be in prin­ci­ple more difficult than ob­serv­ing the whole group, try­ing to learn some “hu­man uni­ver­sals” and some “hu­man speci­fics”. A differ­ent way of think­ing about this could be by mak­ing a par­allel with meta-learn­ing al­gorithms (e.g. REPTILE) but in IRL frame.

What hap­pens if you put a sys­tem com­posed of sub-agents un­der op­ti­miza­tion pres­sure?

It is not clear to me what would hap­pen if you, for ex­am­ple, suc­cess­fully “learn” such a sys­tem of “mo­ti­va­tions” from a hu­man, and then put it in­side of some op­ti­miza­tion pro­cess se­lect­ing for VNM-like ra­tio­nal be­havi­our.

It seems plau­si­ble the some­what messy sys­tem will be forced to get more in­ter­nally al­igned; for ex­am­ple, one way how it can hap­pen is one of the sub-agent sys­tems takes con­trol and “wipes out the op­po­si­tion”.

What hap­pens if you make a sys­tem com­posed of sub-agents less com­pu­ta­tion­ally bounded?

It is not clear that the rel­a­tive pow­ers of sub-agents will scale the same with the whole sys­tem be­com­ing less com­pu­ta­tion­ally bounded. (This is re­lated to MIRI’s sub-agents agenda)

Suggested non-tech­ni­cal re­search directions

Hu­man self-al­ign­ment.

All other things be­ing equal, it seem safer to try to al­ign AI with hu­mans which are self-al­igned.

Notes & Dis­cus­sion


Part of my mo­ti­va­tion for writ­ing this was an an­noy­ance: there is a plenty of rea­sons to be­lieve the view

  • hu­man mind is a unified whole,

  • at first ap­prox­i­ma­tion op­ti­miz­ing some util­ity func­tion,

  • this util­ity is over world-states,

is nei­ther a good model of hu­mans, nor the best model how to think about AI. Yet, it is the paradigm shap­ing a lot of thoughts and re­search. I hope if the an­noy­ance sur­faced in the text, it is not too dis­trac­tive.

Multi-part minds in literature

There are dozens of schemes de­scribing mind as some sort of multi-part sys­tem, so there is noth­ing origi­nal about this claim. Based on a very shal­low re­view, it seems the way how psy­chol­o­gists of­ten con­cep­tu­al­ize the sub-agents is as sub­per­son­al­ities, which are al­most fully hu­man. This seems to err on the side of sub-agents be­ing too com­plex, and an­thro­po­mor­phis­ing in­stead of try­ing to de­scribe for­mally. (Ex­plain­ing hu­mans as a com­po­si­tion of hu­mans is not much use­ful for AI al­ign­ment). On the other hand, Min­sky’s So­ciety of Mind has sub-agents which of­ten seem to be too sim­ple (e.g. similar in com­plex­ity to in­di­vi­d­ual logic gates). If there is some liter­a­ture hav­ing sub-agent com­plex­ity right, and sub-agents be­ing in­side pre­dic­tive pro­cess­ing, I’d be re­ally ex­cited about it!


When dis­cus­sion the draft, sev­eral friends noted some­thing along the line: “It is overde­ter­mined that ap­proaches like IRL are doomed. There are many rea­sons for that and the re­search com­mu­nity is aware of them”. To some ex­tent, I agree this is the case, on the other hand 1. the de­scribed model of mind may pose prob­lems even for more so­phis­ti­cated ap­proaches 2. My im­pres­sion is many peo­ple still have some­thing like util­ity-max­i­miz­ing agent as a the cen­tral ex­am­ple.

The com­ple­men­tary ob­jec­tion is that while in­ter­act­ing sub-agents may be a more pre­cise model, it seems in prac­tice it is of­ten enough to think about hu­mans as unified agents is good enough, and may be good enough even for the pur­pose of AI al­ign­ment. My in­tu­itions on this is based on the con­nec­tion of ra­tio­nal­ity to ex­ploita­bil­ity: it seems hu­mans are usu­ally more ra­tio­nal and less ex­ploitable when think­ing about nar­row do­mains, but can be quite bad when vastly differ­ent sub­sys­tems are in in play (imag­ine on one side a per­son ex­chang­ing stock and money, on the other side some units of money, free time, friend­ship, etc.. In the sec­ond case, many peo­ple are will­ing to trade in differ­ent situ­a­tions by very differ­ent rates)

I’d like to thank Linda Linse­fors , Alexey Turchin, Tomáš Gavenčiak, Max Daniel, Ryan Carey, Ro­hin Shah, Owen Cot­ton-Bar­ratt and oth­ers for helpful dis­cus­sions. Part of this origi­nated in the efforts of the “Hid­den As­sump­tions” team on the 2nd AI safety camp, and my thoughts about how minds work are in­spired by CFAR.