Humans Are Embedded Agents Too

Most mod­els of agency (in game the­ory, de­ci­sion the­ory, etc) im­plic­itly as­sume that the agent is sep­a­rate from the en­vi­ron­ment—there is a “Carte­sian bound­ary” be­tween agent and en­vi­ron­ment. The em­bed­ded agency se­quence goes through a long list of the­o­ret­i­cal/​con­cep­tual prob­lems which arise when an agent is in­stead em­bed­ded in its en­vi­ron­ment. Some ex­am­ples:

  • No defined/​in­put out­put chan­nels over which to optimize

  • Agent might ac­ci­den­tally self-mod­ify, e.g. drop a rock on its head

  • Agent might in­ten­tion­ally self-mod­ify, e.g. change its own source code

  • Hard to define hy­po­thet­i­cals which don’t ac­tu­ally hap­pen, e.g. “I will kill the hostages if you don’t pay the ran­som”

  • Agent may con­tain sub­com­po­nents which op­ti­mize for differ­ent things

  • Agent is made of parts (e.g. atoms) whose be­hav­ior can be pre­dicted with­out think­ing of the agent as agenty—e.g. with­out think­ing of the agent as mak­ing choices or hav­ing beliefs

  • Agent is not log­i­cally om­ni­scient: it can­not know all the im­pli­ca­tions of its own beliefs

The em­bed­ded agency se­quence mostly dis­cusses how these is­sues cre­ate prob­lems for de­sign­ing re­li­able AI. Less dis­cussed is how these same is­sues show up when mod­el­ling hu­mans—and, in par­tic­u­lar, when try­ing to define hu­man val­ues (i.e. “what hu­mans want”). Many—ar­guably most—of the prob­lems al­ign­ment re­searchers run into when try­ing to cre­ate ro­bust poin­t­ers to hu­man val­ues are the same prob­lems we en­counter when talk­ing about em­bed­ded agents in gen­eral.

I’ll run through a bunch of ex­am­ples be­low, and tie each to a cor­re­spond­ing prob­lem-class in em­bed­ded agency. While read­ing, bear in mind that di­rectly an­swer­ing the ques­tions posed is not the point. The point is that each of these prob­lems is a symp­tom of the un­der­ly­ing is­sue: hu­mans are em­bed­ded agents. Patch­ing over each prob­lem one-by-one will pro­duce a spaghetti tower; ideally we’d tackle the prob­lem closer to the root.

The Key­board is Not The Human

Let’s imag­ine that we have an AI which com­mu­ni­cates with its hu­man op­er­a­tor via screen and key­board. It tries to figure out what the hu­man wants based on what’s typed at the key­board.

A few pos­si­ble failure modes in this setup:

  • The AI wire­heads by seiz­ing con­trol of the key­board (ei­ther in­ten­tion­ally or ac­ci­den­tally)

  • A cat walks across the key­board ev­ery now and then, and the AI doesn’t re­al­ize that this in­put isn’t from the human

  • After a code patch, the AI filters out cat-in­put, but also filters out some con­fus­ing (but im­por­tant) in­put from the human

Embed­ded agency prob­lem: hu­mans do not have well-defined out­put chan­nels. We can­not just point to a key­board and say “any in­for­ma­tion from that key­board is di­rect out­put from the hu­man”. Of course we can come up with marginally bet­ter solu­tions than a key­board—e.g. voice recog­ni­tion—but even­tu­ally we’ll run into similar is­sues. There is noth­ing in the world we can point to and say “that’s the hu­man’s out­put chan­nel, the en­tire out­put chan­nel, and noth­ing but the out­put chan­nel”. Nor does any such out­put chan­nel ex­ist, so e.g. we won’t solve the prob­lem just by hav­ing un­cer­tainty over where ex­actly the out­put chan­nel is.

Mod­ified Humans

Be­cause hu­mans are em­bed­ded in the phys­i­cal world, there is no fun­da­men­tal block to an AI mod­ify­ing us (ei­ther in­ten­tion­ally or un­in­ten­tion­ally). Define what a “hu­man” is based on some neu­ral net­work which rec­og­nizes hu­mans in images, and we risk an AI mod­ify­ing the hu­man by ex­ter­nally-in­visi­ble means rang­ing from drugs to whole­sale re­place­ment.

Embed­ded agency prob­lem: no Carte­sian bound­ary. All the hu­man-parts can be ma­nipu­lated/​mod­ified; the AI is not in a differ­ent phys­i­cal uni­verse from us.


Hu­man choices can de­pend on off-equil­ibrium be­hav­ior—what we or some­one else would do, in a sce­nario which never ac­tu­ally hap­pens. Game the­ory is full of ex­am­ples, es­pe­cially threats: we don’t launch our nukes be­cause we ex­pect our en­e­mies would launch their nukes… yet what we ac­tu­ally ex­pect to hap­pen is for no­body to launch any nukes. Our own be­hav­ior is de­ter­mined by “pos­si­bil­ities” which we don’t ac­tu­ally ex­pect to hap­pen, and which may not even be pos­si­ble. Embed­ded agency prob­lem: coun­ter­fac­tu­als.

Go­ing even fur­ther: our val­ues them­selves can de­pend on coun­ter­fac­tu­als. My en­joy­ment of a meal some­times de­pends on what the al­ter­na­tives were, even when the meal is my top pick—I’m hap­pier if I didn’t pass up some­thing nearly-as-good. We’re of­ten un­happy to be forced into a choice, even if it’s a choice we would have made any­way. What does it mean to “have a choice”, in the sense that mat­ters for hu­man val­ues? How do we phys­i­cally ground that con­cept? If we want a friendly AI to al­low us choices, rather than force us to do what’s best for us, then we need an­swers to ques­tions like these.


Hu­mans have differ­ent prefer­ences while drunk than while sober [CITATION NEEDED]. When point­ing an AI at “hu­man val­ues”, it’s tempt­ing to sim­ply say “don’t count de­ci­sions made while drunk”. But on the other hand, peo­ple of­ten drink to in­ten­tion­ally lower their own in­hi­bi­tions—sug­gest­ing that, at a meta-level, they want to self-mod­ify into mak­ing low-in­hi­bi­tion de­ci­sions (at least tem­porar­ily, and within some con­text, e.g. at a party).

Embed­ded agency prob­lem: self-mod­ifi­ca­tion and ro­bust del­e­ga­tion. When a hu­man in­ten­tion­ally self-mod­ifies, to what ex­tent should their pre­vi­ous val­ues be hon­ored, to what ex­tent their new val­ues, and to what ex­tent their fu­ture val­ues?

Value Drift

Hu­mans gen­er­ally have differ­ent val­ues in child­hood, mid­dle age, and old age. Heck, hu­mans have differ­ent val­ues just from be­ing hangry! Sup­pose a hu­man makes a pre­com­mit­ment, and then later on, their val­ues drift—the pre­com­mit­ment be­comes a non­triv­ial con­straint, push­ing them to do some­thing they no longer wish to do. How should a friendly AI han­dle that pre­com­mit­ment?

Embed­ded agency prob­lem: tiling & del­e­ga­tion failures. As hu­mans prop­a­gate through time, our val­ues are not sta­ble, even in the ab­sence of in­ten­tional self-mod­ifi­ca­tion. Un­like in the AI case, we can’t just de­sign hu­mans to have more sta­ble val­ues. (Or can we? Would that even be de­sir­able?)


Hu­mans have sub­sys­tems. Those sub­sys­tems do not always want the same things. Stated prefer­ences and re­vealed prefer­ences do not gen­er­ally match. Akra­sia ex­ists; many peo­ple in­dulge in clicker games no mat­ter how much some other part of them­selves wishes they could be more pro­duc­tive.

Embed­ded agency prob­lem: sub­sys­tem al­ign­ment. Hu­man sub­sys­tems are not all al­igned all the time. Un­like the AI case, we can’t just de­sign hu­mans to have bet­ter-al­igned sub­sys­tems—first we’d need to de­cide what to al­ign them to, and it’s not ob­vi­ous that any one par­tic­u­lar sub­sys­tem con­tains the hu­man’s “true” val­ues.

Prefer­ences Over Quan­tum Fields

Hu­mans gen­er­ally don’t have prefer­ences over quan­tum fields di­rectly. The things we value are ab­stract, high-level ob­jects and no­tions. Embed­ded agency prob­lem: multi-level world mod­els. How do we take the ab­stract ob­jects/​no­tions over which hu­man val­ues op­er­ate, and tie them back to phys­i­cal ob­serv­ables?

At the same time, our val­ues ul­ti­mately need to be grounded in quan­tum fields, be­cause that’s what the world is made of. Hu­man val­ues should not seem­ingly cease to ex­ist just be­cause the world is quan­tum and we thought it was clas­si­cal. It all adds up to nor­mal­ity. Embed­ded agency prob­lem: on­tolog­i­cal crises. How do we en­sure that a friendly AI can still point to hu­man val­ues even if its model of the world fun­da­men­tally shifts?

Un­re­al­ized Implications

I have, on at least one oc­ca­sion, com­pletely switched a poli­ti­cal po­si­tion in about half an hour af­ter hear­ing an ar­gu­ment I had not pre­vi­ously con­sid­ered. More gen­er­ally, we hu­mans tend to up­date our be­liefs, our strate­gies, and what-we-be­lieve-to-be-our-val­ues as new im­pli­ca­tions are re­al­ized.

Embed­ded agency prob­lem: log­i­cal non-om­ni­science. We do not un­der­stand the full im­pli­ca­tions of what we know, and some­times we base our de­ci­sions/​strate­gies/​what-we-be­lieve-to-be-our-val­ues on flawed logic. How is a friendly AI to rec­og­nize and han­dle such cases?

So­cially Strate­gic Self-Modification

Be­cause hu­mans are all em­bed­ded in one phys­i­cal world, ly­ing is hard. There are side-chan­nels which leak in­for­ma­tion, and hu­mans have long since evolved to pay at­ten­tion to those side-chan­nels. One side effect: the eas­iest way to “de­ceive” oth­ers is to de­ceive one­self, via self-mod­ifi­ca­tion. Embed­ded agency prob­lem: co­or­di­na­tion with visi­ble source code, plus self-mod­ifi­ca­tion.

We earnestly adopt both the be­liefs and val­ues of those around us. Are those our “true” val­ues? How should a friendly AI treat val­ues adopted due to so­cial pres­sure? More gen­er­ally, how should a friendly AI han­dle hu­man self-mod­ifi­ca­tions driven by so­cial pres­sure?

Com­bin­ing this with ear­lier ex­am­ples: per­haps we spend an evening drunk be­cause it gives us a so­cially-vi­able ex­cuse to do what­ever we wanted to do any­way. Then the next day, we bow to so­cial pres­sure and earnestly re­gret our ac­tions of the pre­vi­ous night—or at least some of our sub­sys­tems do. Other sub­sys­tems still had fun while drunk, and we do the same thing the next week­end. What is a friendly AI to make of this? Where, in this mess, are the hu­mans’ “val­ues”?

Th­ese are the sorts of shenani­gans one needs to deal with when deal­ing with em­bed­ded agents, and I ex­pect that a bet­ter un­der­stand­ing of em­bed­ded agents in gen­eral will lead to sub­stan­tial in­sights about the na­ture of hu­man val­ues.