# Embedded Agency (full-text version)

Sup­pose you want to build a robot to achieve some real-world goal for you—a goal that re­quires the robot to learn for it­self and figure out a lot of things that you don’t already know.

There’s a com­pli­cated en­g­ineer­ing prob­lem here. But there’s also a prob­lem of figur­ing out what it even means to build a learn­ing agent like that. What is it to op­ti­mize re­al­is­tic goals in phys­i­cal en­vi­ron­ments? In broad terms, how does it work?

In this post, I’ll point to four ways we don’t cur­rently know how it works, and four ar­eas of ac­tive re­search aimed at figur­ing it out.

### 1. Embed­ded agents

This is Alexei, and Alexei is play­ing a video game.

Like most games, this game has clear in­put and out­put chan­nels. Alexei only ob­serves the game through the com­puter screen, and only ma­nipu­lates the game through the con­trol­ler.

The game can be thought of as a func­tion which takes in a se­quence of but­ton presses and out­puts a se­quence of pix­els on the screen.

Alexei is also very smart, and ca­pa­ble of hold­ing the en­tire video game in­side his mind. If Alexei has any un­cer­tainty, it is only over em­piri­cal facts like what game he is play­ing, and not over log­i­cal facts like which in­puts (for a given de­ter­minis­tic game) will yield which out­puts. This means that Alexei must also store in­side his mind ev­ery pos­si­ble game he could be play­ing.

Alexei does not, how­ever, have to think about him­self. He is only op­ti­miz­ing the game he is play­ing, and not op­ti­miz­ing the brain he is us­ing to think about the game. He may still choose ac­tions based off of value of in­for­ma­tion, but this is only to help him rule out pos­si­ble games he is play­ing, and not to change the way in which he thinks.

In fact, Alexei can treat him­self as an un­chang­ing in­di­visi­ble atom. Since he doesn’t ex­ist in the en­vi­ron­ment he’s think­ing about, Alexei doesn’t worry about whether he’ll change over time, or about any sub­rou­tines he might have to run.

No­tice that all the prop­er­ties I talked about are par­tially made pos­si­ble by the fact that Alexei is cleanly sep­a­rated from the en­vi­ron­ment that he is op­ti­miz­ing.

This is Emmy. Emmy is play­ing real life.

Real life is not like a video game. The differ­ences largely come from the fact that Emmy is within the en­vi­ron­ment that she is try­ing to op­ti­mize.

Alexei sees the uni­verse as a func­tion, and he op­ti­mizes by choos­ing in­puts to that func­tion that lead to greater re­ward than any of the other pos­si­ble in­puts he might choose. Emmy, on the other hand, doesn’t have a func­tion. She just has an en­vi­ron­ment, and this en­vi­ron­ment con­tains her.

Emmy wants to choose the best pos­si­ble ac­tion, but which ac­tion Emmy chooses to take is just an­other fact about the en­vi­ron­ment. Emmy can rea­son about the part of the en­vi­ron­ment that is her de­ci­sion, but since there’s only one ac­tion that Emmy ends up ac­tu­ally tak­ing, it’s not clear what it even means for Emmy to “choose” an ac­tion that is bet­ter than the rest.

Alexei can poke the uni­verse and see what hap­pens. Emmy is the uni­verse pok­ing it­self. In Emmy’s case, how do we for­mal­ize the idea of “choos­ing” at all?

To make mat­ters worse, since Emmy is con­tained within the en­vi­ron­ment, Emmy must also be smaller than the en­vi­ron­ment. This means that Emmy is in­ca­pable of stor­ing ac­cu­rate de­tailed mod­els of the en­vi­ron­ment within her mind.

This causes a prob­lem: Bayesian rea­son­ing works by start­ing with a large col­lec­tion of pos­si­ble en­vi­ron­ments, and as you ob­serve facts that are in­con­sis­tent with some of those en­vi­ron­ments, you rule them out. What does rea­son­ing look like when you’re not even ca­pa­ble of stor­ing a sin­gle valid hy­poth­e­sis for the way the world works? Emmy is go­ing to have to use a differ­ent type of rea­son­ing, and make up­dates that don’t fit into the stan­dard Bayesian frame­work.

Since Emmy is within the en­vi­ron­ment that she is ma­nipu­lat­ing, she is also go­ing to be ca­pa­ble of self-im­prove­ment. But how can Emmy be sure that as she learns more and finds more and more ways to im­prove her­self, she only changes her­self in ways that are ac­tu­ally helpful? How can she be sure that she won’t mod­ify her origi­nal goals in un­de­sir­able ways?

Fi­nally, since Emmy is con­tained within the en­vi­ron­ment, she can’t treat her­self like an atom. She is made out of the same pieces that the rest of the en­vi­ron­ment is made out of, which is what causes her to be able to think about her­self.

In ad­di­tion to haz­ards in her ex­ter­nal en­vi­ron­ment, Emmy is go­ing to have to worry about threats com­ing from within. While op­ti­miz­ing, Emmy might spin up other op­ti­miz­ers as sub­rou­tines, ei­ther in­ten­tion­ally or un­in­ten­tion­ally. Th­ese sub­sys­tems can cause prob­lems if they get too pow­er­ful and are un­al­igned with Emmy’s goals. Emmy must figure out how to rea­son with­out spin­ning up in­tel­li­gent sub­sys­tems, or oth­er­wise figure out how to keep them weak, con­tained, or al­igned fully with her goals.

#### 1.1. Dual­is­tic agents

Emmy is con­fus­ing, so let’s go back to Alexei. Mar­cus Hut­ter’s AIXI frame­work gives a good the­o­ret­i­cal model for how agents like Alexei work:

The model has an agent and an en­vi­ron­ment that in­ter­act us­ing ac­tions, ob­ser­va­tions, and re­wards. The agent sends out an ac­tion , and then the en­vi­ron­ment sends out both an ob­ser­va­tion and a re­ward . This pro­cess re­peats at each time .

Each ac­tion is a func­tion of all the pre­vi­ous ac­tion-ob­ser­va­tion-re­ward triples. And each ob­ser­va­tion and re­ward is similarly a func­tion of these triples and the im­me­di­ately pre­ced­ing ac­tion.

You can imag­ine an agent in this frame­work that has full knowl­edge of the en­vi­ron­ment that it’s in­ter­act­ing with. How­ever, AIXI is used to model op­ti­miza­tion un­der un­cer­tainty about the en­vi­ron­ment. AIXI has a dis­tri­bu­tion over all pos­si­ble com­putable en­vi­ron­ments , and chooses ac­tions that lead to a high ex­pected re­ward un­der this dis­tri­bu­tion. Since it also cares about fu­ture re­ward, this may lead to ex­plor­ing for value of in­for­ma­tion.

Un­der some as­sump­tions, we can show that AIXI does rea­son­ably well in all com­putable en­vi­ron­ments, in spite of its un­cer­tainty. How­ever, while the en­vi­ron­ments that AIXI is in­ter­act­ing with are com­putable, AIXI it­self is un­com­putable. The agent is made out of a differ­ent sort of stuff, a more pow­er­ful sort of stuff, than the en­vi­ron­ment.

We will call agents like AIXI and Alexei “du­al­is­tic.” They ex­ist out­side of their en­vi­ron­ment, with only set in­ter­ac­tions be­tween agent-stuff and en­vi­ron­ment-stuff. They re­quire the agent to be larger than the en­vi­ron­ment, and don’t tend to model self-refer­en­tial rea­son­ing, be­cause the agent is made of differ­ent stuff than what the agent rea­sons about.

AIXI is not alone. Th­ese du­al­is­tic as­sump­tions show up all over our cur­rent best the­o­ries of ra­tio­nal agency.

I set up AIXI as a bit of a foil, but AIXI can also be used as in­spira­tion. When I look at AIXI, I feel like I re­ally un­der­stand how Alexei works. This is the kind of un­der­stand­ing that I want to also have for Emmy.

Un­for­tu­nately, Emmy is con­fus­ing. When I talk about want­ing to have a the­ory of “em­bed­ded agency,” I mean I want to be able to un­der­stand the­o­ret­i­cally how agents like Emmy work. That is, agents that are em­bed­ded within their en­vi­ron­ment and thus:

• do not have well-defined i/​o chan­nels;

• are smaller than their en­vi­ron­ment;

• are able to rea­son about them­selves and self-im­prove;

• and are made of parts similar to the en­vi­ron­ment.

You shouldn’t think of these four com­pli­ca­tions as a par­ti­tion. They are very en­tan­gled with each other.

For ex­am­ple, the rea­son the agent is able to self-im­prove is be­cause it is made of parts. And any time the en­vi­ron­ment is suffi­ciently larger than the agent, it might con­tain other copies of the agent, and thus de­stroy any well-defined i/​o chan­nels.

How­ever, I will use these four com­pli­ca­tions to in­spire a split of the topic of em­bed­ded agency into four sub­prob­lems. Th­ese are: de­ci­sion the­ory, em­bed­ded world-mod­els, ro­bust del­e­ga­tion, and sub­sys­tem al­ign­ment.

#### 1.2. Embed­ded subproblems

De­ci­sion the­ory is all about em­bed­ded op­ti­miza­tion.

The sim­plest model of du­al­is­tic op­ti­miza­tion is . takes in a func­tion from ac­tions to re­wards, and re­turns the ac­tion which leads to the high­est re­ward un­der this func­tion. Most op­ti­miza­tion can be thought of as some var­i­ant on this. You have some space; you have a func­tion from this space to some score, like a re­ward or util­ity; and you want to choose an in­put that scores highly un­der this func­tion.

But we just said that a large part of what it means to be an em­bed­ded agent is that you don’t have a func­tional en­vi­ron­ment. So now what do we do? Op­ti­miza­tion is clearly an im­por­tant part of agency, but we can’t cur­rently say what it is even in the­ory with­out mak­ing ma­jor type er­rors.

Some ma­jor open prob­lems in de­ci­sion the­ory in­clude:

• log­i­cal coun­ter­fac­tu­als: how do you rea­son about what would hap­pen if you take ac­tion B, given that you can prove that you will in­stead take ac­tion A?

• en­vi­ron­ments that in­clude mul­ti­ple copies of the agent, or trust­wor­thy pre­dic­tions of the agent.

• log­i­cal up­date­less­ness, which is about how to com­bine the very nice but very Bayesian world of Wei Dai’s up­date­less de­ci­sion the­ory, with the much less Bayesian world of log­i­cal un­cer­tainty.

Embed­ded world-models

is about how you can make good mod­els of the world that are able to fit within an agent that is much smaller than the world.<p>

This has proven to be very difficult—first, be­cause it means that the true uni­verse is not in your hy­poth­e­sis space, which ru­ins a lot of the­o­ret­i­cal guaran­tees; and sec­ond, be­cause it means we’re go­ing to have to make non-Bayesian up­dates as we learn, which also ru­ins a bunch of the­o­ret­i­cal guaran­tees.

It is also about how to make world-mod­els from the point of view of an ob­server on the in­side, and re­sult­ing prob­lems such as an­throp­ics. Some ma­jor open prob­lems in em­bed­ded world-mod­els in­clude:

• log­i­cal un­cer­tainty, which is about how to com­bine the world of logic with the world of prob­a­bil­ity.

• multi-level mod­el­ing, which is about how to have mul­ti­ple mod­els of the same world at differ­ent lev­els of de­scrip­tion, and tran­si­tion nicely be­tween them.

• on­tolog­i­cal crises, which is what to do when you re­al­ize that your model, or even your goal, was speci­fied us­ing a differ­ent on­tol­ogy than the real world.

Ro­bust del­e­ga­tion is all about a spe­cial type of prin­ci­pal-agent prob­lem. You have an ini­tial agent that wants to make a more in­tel­li­gent suc­ces­sor agent to help it op­ti­mize its goals. The ini­tial agent has all of the power, be­cause it gets to de­cide ex­actly what suc­ces­sor agent to make. But in an­other sense, the suc­ces­sor agent has all of the power, be­cause it is much, much more in­tel­li­gent.

From the point of view of the ini­tial agent, the ques­tion is about cre­at­ing a suc­ces­sor that will ro­bustly not use its in­tel­li­gence against you. From the point of view of the suc­ces­sor agent, the ques­tion is about, “How do you ro­bustly learn or re­spect the goals of some­thing that is stupid, ma­nipu­la­ble, and not even us­ing the right on­tol­ogy?”

There are ex­tra prob­lems com­ing from the Löbian ob­sta­cle mak­ing it im­pos­si­ble to con­sis­tently trust things that are more pow­er­ful than you.

You can think about these prob­lems in the con­text of an agent that’s just learn­ing over time, or in the con­text of an agent mak­ing a sig­nifi­cant self-im­prove­ment, or in the con­text of an agent that’s just try­ing to make a pow­er­ful tool.

The ma­jor open prob­lems in ro­bust del­e­ga­tion in­clude:

• Vingean re­flec­tion, which is about how to rea­son about and trust agents that are much smarter than you, in spite of the Löbian ob­sta­cle to trust.

• value learn­ing, which is how the suc­ces­sor agent can learn the goals of the ini­tial agent in spite of that agent’s stu­pidity and in­con­sis­ten­cies.

• cor­rigi­bil­ity, which is about how an ini­tial agent can get a suc­ces­sor agent to al­low (or even help with) mod­ifi­ca­tions, in spite of an in­stru­men­tal in­cen­tive not to.

Sub­sys­tem al­ign­ment is about how to be one unified agent that doesn’t have sub­sys­tems that are fight­ing against ei­ther you or each other.

When an agent has a goal, like “sav­ing the world,” it might end up spend­ing a large amount of its time think­ing about a sub­goal, like “mak­ing money.” If the agent spins up a sub-agent that is only try­ing to make money, there are now two agents that have differ­ent goals, and this leads to a con­flict. The sub-agent might sug­gest plans that look like they only make money, but ac­tu­ally de­stroy the world in or­der to make even more money.

The prob­lem is: you don’t just have to worry about sub-agents that you in­ten­tion­ally spin up. You also have to worry about spin­ning up sub-agents by ac­ci­dent. Any time you perform a search or an op­ti­miza­tion over a suffi­ciently rich space that’s able to con­tain agents, you have to worry about the space it­self do­ing op­ti­miza­tion. This op­ti­miza­tion may not be ex­actly in line with the op­ti­miza­tion the outer sys­tem was try­ing to do, but it will have an in­stru­men­tal in­cen­tive to look like it’s al­igned.

A lot of op­ti­miza­tion in prac­tice uses this kind of pass­ing the buck. You don’t just find a solu­tion; you find a thing that is able to it­self search for a solu­tion.

In the­ory, I don’t un­der­stand how to do op­ti­miza­tion at all—other than meth­ods that look like find­ing a bunch of stuff that I don’t un­der­stand, and see­ing if it ac­com­plishes my goal. But this is ex­actly the kind of thing that’s most prone to spin­ning up ad­ver­sar­ial sub­sys­tems.

The big open prob­lem in sub­sys­tem al­ign­ment is about how to have an outer op­ti­mizer that doesn’t spin up ad­ver­sar­ial in­ner op­ti­miz­ers. You can break this prob­lem up fur­ther by con­sid­er­ing cases where the in­ner op­ti­miz­ers are ei­ther in­ten­tional or un­in­ten­tional, and con­sid­er­ing re­stricted sub­classes of op­ti­miza­tion, like in­duc­tion.

But re­mem­ber: de­ci­sion the­ory, em­bed­ded world-mod­els, ro­bust del­e­ga­tion, and sub­sys­tem al­ign­ment are not four sep­a­rate prob­lems. They’re all differ­ent sub­prob­lems of the same unified con­cept that is em­bed­ded agency.

### 2. De­ci­sion theory

De­ci­sion the­ory and ar­tifi­cial in­tel­li­gence typ­i­cally try to com­pute some­thing resembling

I.e., max­i­mize some func­tion of the ac­tion. This tends to as­sume that we can de­tan­gle things enough to see out­comes as a func­tion of ac­tions.

For ex­am­ple, AIXI rep­re­sents the agent and the en­vi­ron­ment as sep­a­rate units which in­ter­act over time through clearly defined i/​o chan­nels, so that it can then choose ac­tions max­i­miz­ing re­ward.

When the agent model is a part of the en­vi­ron­ment model, it can be sig­nifi­cantly less clear how to con­sider tak­ing al­ter­na­tive ac­tions.

For ex­am­ple, be­cause the agent is smaller than the en­vi­ron­ment, there can be other copies of the agent, or things very similar to the agent. This leads to con­tentious de­ci­sion-the­ory prob­lems such as the Twin Pri­soner’s Dilemma and New­comb’s prob­lem.

If Emmy Model 1 and Emmy Model 2 have had the same ex­pe­riences and are run­ning the same source code, should Emmy Model 1 act like her de­ci­sions are steer­ing both robots at once? Depend­ing on how you draw the bound­ary around “your­self”, you might think you con­trol the ac­tion of both copies, or only your own.

<a>Prob­lems of adapt­ing de­ci­sion the­ory to em­bed­ded agents in­clude:
• counterfactuals

• New­comblike rea­son­ing, in which the agent in­ter­acts with copies of itself

• ex­tor­tion problems

• co­or­di­na­tion problems

• log­i­cal counterfactuals

• log­i­cal updatelessness

#### 2.1. Counterfactuals

The first difficulty can be illus­trated by the five-and-ten prob­lem. Sup­pose we have the op­tion of tak­ing a five dol­lar bill or a ten dol­lar bill, and all we care about in the situ­a­tion is how much money we get. Ob­vi­ously, we should take the $10. How­ever, it is not so easy as it seems to re­li­ably take the$10 when the agent knows its own be­hav­ior. If you rea­son about your­self as just an­other part of the en­vi­ron­ment, then you can know your own ac­tion. If you can know your own ac­tion, then it be­comes difficult to rea­son about what would hap­pen if you took differ­ent ac­tions. This means an agent can sta­bly take the $5 be­cause it be­lieves “If I take the$10, I get $0”! This er­ror is com­ing from a con­fu­sion where we re­place the in­tu­itive coun­ter­fac­tual “if” with log­i­cal im­pli­ca­tion. This may seem like a silly con­fu­sion, but there is not much else we can do, be­cause we don’t know how to for­mal­ize the coun­ter­fac­tual “if” cor­rectly. We could in­stead try to use prob­a­bil­ity to for­mal­ize coun­ter­fac­tu­als, but this won’t work ei­ther. If we try to calcu­late the ex­pected util­ity of our ac­tions by Bayesian con­di­tion­ing, as is com­mon, know­ing our own be­hav­ior leads to a di­vide-by-zero er­ror when we try to calcu­late the ex­pected util­ity of ac­tions we know we don’t take: im­plies , which im­plies , which implies Be­cause the agent doesn’t know how to sep­a­rate it­self from the en­vi­ron­ment, it gets gnash­ing in­ter­nal gears when it tries to imag­ine tak­ing differ­ent ac­tions. This is an in­stance of the prob­lem of coun­ter­fac­tual rea­son­ing: how do we eval­u­ate hy­po­thet­i­cals like “What if the sun sud­denly went out”? The most cen­tral ex­am­ple of why agents need to think about coun­ter­fac­tu­als comes from coun­ter­fac­tu­als about their own ac­tions. This is es­pe­cially tricky if you already know what you’re go­ing to do, the same way “what if the sun sud­denly went out” is es­pe­cially tricky if you know that it won’t, or “what if 2+2=3″ is es­pe­cially tricky if you know 2+2=4. When the agent is part of the en­vi­ron­ment, it be­comes difficult to dis­t­in­guish rea­son­ing about your­self from rea­son­ing in gen­eral, so you run the risk of know­ing your own ac­tion. Why might an agent come to know its own ac­tion be­fore it has acted? Per­haps the agent is try­ing to plan ahead, or rea­son about a game-the­o­retic situ­a­tion in which its ac­tion has an in­tri­cate role to play. But the biggest com­pli­ca­tion comes from Löb’s The­o­rem. This can be illus­trated more clearly by look­ing at the be­hav­ior of sim­ple logic-based agents rea­son­ing about the five-and-ten prob­lem. Con­sider this ex­am­ple: We have the source code for an agent and the uni­verse. They can re­fer to each other through the use of quin­ing. The uni­verse is sim­ple; the uni­verse just out­puts what­ever the agent out­puts. The agent spends a long time search­ing for proofs about what hap­pens if it takes var­i­ous ac­tions. If for some and equal to , , or , it finds a proof that tak­ing the leads to util­ity, that tak­ing the leads to util­ity, and that , it will nat­u­rally take the . We ex­pect that it won’t find such a proof, and will in­stead pick the de­fault ac­tion of tak­ing the . It seems easy when you just imag­ine an agent try­ing to rea­son about the uni­verse. Yet it turns out that if the amount of time spent search­ing for proofs is enough, the agent will always choose ! The proof that this is so is by Löb’s the­o­rem. Löb’s the­o­rem says that, for any propo­si­tion , if you can prove that a proof of would im­ply the truth of , then you can prove . In sym­bols, with “” mean­ing ” is prov­able”: In the ver­sion of the five-and-ten prob­lem I gave, “” is the propo­si­tion “if the agent out­puts the uni­verse out­puts , and if the agent out­puts the uni­verse out­puts ”. Sup­pos­ing it is prov­able, the agent will even­tu­ally find the proof, and re­turn in fact. This makes the sen­tence true, since the agent out­puts and the uni­verse out­puts , and since it’s false that the agent out­puts . This is be­cause false propo­si­tions like “the agent out­puts ” im­ply ev­ery­thing, in­clud­ing the uni­verse out­putting . The agent can (given enough time) prove all of this, in which case the agent in fact proves the propo­si­tion “if the agent out­puts the uni­verse out­puts , and if the agent out­puts the uni­verse out­puts ”. And as a re­sult, the agent takes the$5.

Let’s as­sume we search for short proofs first. In this case, we will take the $10, since it is very easy to show that leads to and leads to . The prob­lem is that spu­ri­ous proofs can be short too, and don’t get much longer when the uni­verse gets harder to pre­dict. If we re­place the uni­verse with one that is prov­ably func­tion­ally the same, but is harder to pre­dict, the short­est proof will short-cir­cuit the com­pli­cated uni­verse and be spu­ri­ous. Peo­ple of­ten try to solve the prob­lem of coun­ter­fac­tu­als by sug­gest­ing that there will always be some un­cer­tainty. An AI may know its source code perfectly, but it can’t perfectly know the hard­ware it is run­ning on. Does adding a lit­tle un­cer­tainty solve the prob­lem? Often not: • The proof of the spu­ri­ous coun­ter­fac­tual of­ten still goes through; if you think you are in a five-and-ten prob­lem with a 95% cer­tainty, you can have the usual prob­lem within that 95%. • Ad­ding un­cer­tainty to make coun­ter­fac­tu­als well-defined doesn’t get you any guaran­tee that the coun­ter­fac­tu­als will be rea­son­able. Hard­ware failures aren’t of­ten what you want to ex­pect when con­sid­er­ing al­ter­nate ac­tions. Con­sider this sce­nario: You are con­fi­dent that you al­most always take the left path. How­ever, it is pos­si­ble (though un­likely) for a cos­mic ray to dam­age your cir­cuits, in which case you could go right—but you would then be in­sane, which would have many other bad con­se­quences. If this rea­son­ing in it­self is why you always go left, you’ve gone wrong. So I’m not talk­ing about agents who know their own ac­tions be­cause I think there’s go­ing to be a big prob­lem with in­tel­li­gent ma­chines in­fer­ring their own ac­tions in the fu­ture. Rather, the pos­si­bil­ity of know­ing your own ac­tions illus­trates some­thing con­fus­ing about de­ter­min­ing the con­se­quences of your ac­tions—a con­fu­sion which shows up even in the very sim­ple case where ev­ery­thing about the world is known and you just need to choose the larger pile of money. Maybe we can force ex­plo­ra­tion ac­tions, so that we learn what hap­pens when we do things? This pro­posal runs into two prob­lems: • A bad prior can think that ex­plor­ing is dan­ger­ous. • Forc­ing it to take ex­plo­ra­tory ac­tions doesn’t teach it what the world would look like if it took those ac­tions de­liber­ately. But writ­ing down ex­am­ples of “cor­rect” coun­ter­fac­tual rea­son­ing doesn’t seem hard from the out­side! Maybe that’s be­cause from “out­side” we always have a du­al­is­tic per­spec­tive. We are in fact sit­ting out­side of the prob­lem, and we’ve defined it as a func­tion of an agent. How­ever, an agent can’t solve the prob­lem in the same way from in­side. From its per­spec­tive, its func­tional re­la­tion­ship with the en­vi­ron­ment isn’t an ob­serv­able fact. This is why “coun­ter­fac­tu­als” are called what they are called, af­ter all. When I told you about the 5 and 10 prob­lem, I first told you about the prob­lem, and then gave you an agent. When one agent doesn’t work well, we could con­sider a differ­ent agent. Find­ing a way to suc­ceed at a de­ci­sion prob­lem in­volves find­ing an agent that when plugged into the prob­lem takes the right ac­tion. The fact that we can even con­sider putting in differ­ent agents means that we have already carved the uni­verse into an “agent” part, plus the rest of the uni­verse with a hole for the agent—which is most of the work! #### 2.3. Updatelessness Are we just fool­ing our­selves due to the way we set up de­ci­sion prob­lems, then? Are there no “cor­rect” coun­ter­fac­tu­als? Well, maybe we are fool­ing our­selves. But there is still some­thing we are con­fused about! “Coun­ter­fac­tu­als are sub­jec­tive, in­vented by the agent” doesn’t dis­solve the mys­tery. There is some­thing in­tel­li­gent agents do, in the real world, to make de­ci­sions. Up­date­less de­ci­sion the­ory (UDT) views the prob­lem from “closer to the out­side”. It does this by pick­ing the ac­tion which the agent would have wanted to com­mit to be­fore get­ting into the situ­a­tion. Con­sider the fol­low­ing game: Alice re­ceives a card at ran­dom which is ei­ther High or Low. She may re­veal the card if she wishes. Bob then gives his prob­a­bil­ity that Alice has a high card. Alice always loses dol­lars. Bob loses if the card is low, and if the card is high. Bob has a proper scor­ing rule, so does best by giv­ing his true be­lief. Alice just wants Bob’s be­lief to be as much to­ward “low” as pos­si­ble. Sup­pose Alice will play only this one time. She sees a low card. Bob is good at rea­son­ing about Alice, but is in the next room and so can’t read any tells. Should Alice re­veal her card? Since Alice’s card is low, if she shows it to Bob, she will lose no money, which is the best pos­si­ble out­come. How­ever, this means that in the coun­ter­fac­tual world where Alice sees a high card, she wouldn’t be able to keep the se­cret—she might as well show her card in that case too, since her re­luc­tance to show it would be as re­li­able a sign of “high”. On the other hand, if Alice doesn’t show her card, she loses 25¢—but then she can use the same strat­egy in the other world, rather than los­ing$1. So, be­fore play­ing the game, Alice would want to visi­bly com­mit to not re­veal; this makes ex­pected loss 25¢, whereas the other strat­egy has ex­pected loss 50¢.

This game is equiv­a­lent to the de­ci­sion prob­lem called coun­ter­fac­tual mug­ging. UDT solves such prob­lems by recom­mend­ing that the agent do what­ever would have seemed wis­est be­fore—what­ever your ear­lier self would have com­mit­ted to do.

UDT is an el­e­gant solu­tion to a fairly broad class of de­ci­sion prob­lems. How­ever, it only makes sense if the ear­lier self can fore­see all pos­si­ble situ­a­tions.

This works fine in a Bayesian set­ting where the prior already con­tains all pos­si­bil­ities within it­self. How­ever, there may be no way to do this in a re­al­is­tic em­bed­ded set­ting. An agent has to be able to think of new pos­si­bil­ities—mean­ing that its ear­lier self doesn’t know enough to make all the de­ci­sions.

And with that, we find our­selves squarely fac­ing the prob­lem of em­bed­ded world-mod­els.

### 3. Embed­ded world-models

An agent which is larger than its en­vi­ron­ment can:
• Hold an ex­act model of the en­vi­ron­ment in its head.

• Think through the con­se­quences of ev­ery po­ten­tial course of ac­tion.

• If it doesn’t know the en­vi­ron­ment perfectly, hold ev­ery pos­si­ble way the en­vi­ron­ment could be in its head, as is the case with Bayesian un­cer­tainty.

All of these are typ­i­cal of no­tions of ra­tio­nal agency.

An em­bed­ded agent can’t do any of those things, at least not in any straight­for­ward way.

One difficulty is that, since the agent is part of the en­vi­ron­ment, mod­el­ing the en­vi­ron­ment in ev­ery de­tail would re­quire the agent to model it­self in ev­ery de­tail, which would re­quire the agent’s self-model to be as “big” as the whole agent. An agent can’t fit in­side its own head.

The lack of a crisp agent/​en­vi­ron­ment bound­ary forces us to grap­ple with para­doxes of self-refer­ence. As if rep­re­sent­ing the rest of the world weren’t already hard enough.

Embed­ded World-Models have to rep­re­sent the world in a way more ap­pro­pri­ate for em­bed­ded agents. Prob­lems in this cluster in­clude:

• the “re­al­iz­abil­ity” /​ “grain of truth” prob­lem: the real world isn’t in the agent’s hy­poth­e­sis space

• log­i­cal uncertainty

• high-level models

• multi-level models

• on­tolog­i­cal crises

• nat­u­ral­ized in­duc­tion, the prob­lem that the agent must in­cor­po­rate its model of it­self into its world-model

• an­thropic rea­son­ing, the prob­lem of rea­son­ing with how many copies of your­self exist

#### 3.1. Realizability

In a Bayesian set­ting, where an agent’s un­cer­tainty is quan­tified by a prob­a­bil­ity dis­tri­bu­tion over pos­si­ble wor­lds, a com­mon as­sump­tion is “re­al­iz­abil­ity”: the true un­der­ly­ing en­vi­ron­ment which is gen­er­at­ing the ob­ser­va­tions is as­sumed to have at least some prob­a­bil­ity in the prior.

In game the­ory, this same prop­erty is de­scribed by say­ing a prior has a “grain of truth”. It should be noted, though, that there are ad­di­tional bar­ri­ers to get­ting this prop­erty in a game-the­o­retic set­ting; so, in their com­mon us­age cases, “grain of truth” is tech­ni­cally de­mand­ing while “re­al­iz­abil­ity” is a tech­ni­cal con­ve­nience.

Real­iz­abil­ity is not to­tally nec­es­sary in or­der for Bayesian rea­son­ing to make sense. If you think of a set of hy­pothe­ses as “ex­perts”, and the cur­rent pos­te­rior prob­a­bil­ity as how much you “trust” each ex­pert, then learn­ing ac­cord­ing to Bayes’ Law, , en­sures a rel­a­tive bounded loss prop­erty.

Speci­fi­cally, if you use a prior , the amount worse you are in com­par­i­son to each ex­pert is at most , since you as­sign at least prob­a­bil­ity to see­ing a se­quence of ev­i­dence . In­tu­itively, is your ini­tial trust in ex­pert , and in each case where it is even a lit­tle bit more cor­rect than you, you in­crease your trust ac­cord­ingly. The way you do this en­sures you as­sign an ex­pert prob­a­bil­ity 1 and hence copy it pre­cisely be­fore you lose more than com­pared to it.

The prior AIXI is based on is the Solomonoff prior. It is defined as the out­put of a uni­ver­sal Tur­ing ma­chine (UTM) whose in­puts are coin-flips.

In other words, feed a UTM a ran­dom pro­gram. Nor­mally, you’d think of a UTM as only be­ing able to simu­late de­ter­minis­tic ma­chines. Here, how­ever, the ini­tial in­puts can in­struct the UTM to use the rest of the in­finite in­put tape as a source of ran­dom­ness to simu­late a stochas­tic Tur­ing ma­chine.

Com­bin­ing this with the pre­vi­ous idea about view­ing Bayesian learn­ing as a way of al­lo­cat­ing “trust” to “ex­perts” which meets a bounded loss con­di­tion, we can see the Solomonoff prior as a kind of ideal ma­chine learn­ing al­gorithm which can learn to act like any al­gorithm you might come up with, no mat­ter how clever.

For this rea­son, we shouldn’t nec­es­sar­ily think of AIXI as “as­sum­ing the world is com­putable”, even though it rea­sons via a prior over com­pu­ta­tions. It’s get­ting bounded loss on its pre­dic­tive ac­cu­racy as com­pared with any com­putable pre­dic­tor. We should rather say that AIXI as­sumes all pos­si­ble al­gorithms are com­putable, not that the world is.

How­ever, lack­ing re­al­iz­abil­ity can cause trou­ble if you are look­ing for any­thing more than bounded-loss pre­dic­tive ac­cu­racy:

• the pos­te­rior can os­cillate for­ever;

• prob­a­bil­ities may not be cal­ibrated;

• es­ti­mates of statis­tics such as the mean may be ar­bi­trar­ily bad;

• es­ti­mates of la­tent vari­ables may be bad;

• and the iden­ti­fi­ca­tion of causal struc­ture may not work.

So does AIXI perform well with­out a re­al­iz­abil­ity as­sump­tion? We don’t know. De­spite get­ting bounded loss for pre­dic­tions with­out re­al­iz­abil­ity, ex­ist­ing op­ti­mal­ity re­sults for its ac­tions re­quire an added re­al­iz­abil­ity as­sump­tion.

First, if the en­vi­ron­ment re­ally is sam­pled from the Solomonoff dis­tri­bu­tion, AIXI gets the max­i­mum ex­pected re­ward. But this is fairly triv­ial; it is es­sen­tially the defi­ni­tion of AIXI.

Se­cond, if we mod­ify AIXI to take some­what ran­dom­ized ac­tions—Thomp­son sam­pling—there is an asymp­totic op­ti­mal­ity re­sult for en­vi­ron­ments which act like any stochas­tic Tur­ing ma­chine.

So, ei­ther way, re­al­iz­abil­ity was as­sumed in or­der to prove any­thing. (See Jan Leike, Non­para­met­ric Gen­eral Re­in­force­ment Learn­ing.)

But the con­cern I’m point­ing at is not “the world might be un­com­putable, so we don’t know if AIXI will do well”; this is more of an illus­tra­tive case. The con­cern is that AIXI is only able to define in­tel­li­gence or ra­tio­nal­ity by con­struct­ing an agent much, much big­ger than the en­vi­ron­ment which it has to learn about and act within.

Lau­rent Orseau pro­vides a way of think­ing about this in “Space-Time Embed­ded In­tel­li­gence”. How­ever, his ap­proach defines the in­tel­li­gence of an agent in terms of a sort of su­per-in­tel­li­gent de­signer who thinks about re­al­ity from out­side, se­lect­ing an agent to place into the en­vi­ron­ment.

Embed­ded agents don’t have the lux­ury of step­ping out­side of the uni­verse to think about how to think. What we would like would be a the­ory of ra­tio­nal be­lief for situ­ated agents which pro­vides foun­da­tions that are similarly as strong as the foun­da­tions Bayesi­anism pro­vides for du­al­is­tic agents.

Imag­ine a com­puter sci­ence the­ory per­son who is hav­ing a dis­agree­ment with a pro­gram­mer. The the­ory per­son is mak­ing use of an ab­stract model. The pro­gram­mer is com­plain­ing that the ab­stract model isn’t some­thing you would ever run, be­cause it is com­pu­ta­tion­ally in­tractable. The the­ory per­son re­sponds that the point isn’t to ever run it. Rather, the point is to un­der­stand some phe­nomenon which will also be rele­vant to more tractable things which you would want to run.

I bring this up in or­der to em­pha­size that my per­spec­tive is a lot more like the the­ory per­son’s. I’m not talk­ing about AIXI to say “AIXI is an ideal­iza­tion you can’t run”. The an­swers to the puz­zles I’m point­ing at don’t need to run. I just want to un­der­stand some phe­nom­ena.

How­ever, some­times a thing that makes some the­o­ret­i­cal mod­els less tractable also makes that model too differ­ent from the phe­nomenon we’re in­ter­ested in.

The way AIXI wins games is by as­sum­ing we can do true Bayesian up­dat­ing over a hy­poth­e­sis space, as­sum­ing the world is in our hy­poth­e­sis space, etc. So it can tell us some­thing about the as­pect of re­al­is­tic agency that’s ap­prox­i­mately do­ing Bayesian up­dat­ing over an ap­prox­i­mately-good-enough hy­poth­e­sis space. But em­bed­ded agents don’t just need ap­prox­i­mate solu­tions to that prob­lem; they need to solve sev­eral prob­lems that are differ­ent in kind from that prob­lem.

#### 3.2. Self-reference

One ma­jor ob­sta­cle a the­ory of em­bed­ded agency must deal with is self-refer­ence.

Para­doxes of self-refer­ence such as the liar para­dox make it not just wildly im­prac­ti­cal, but in a cer­tain sense im­pos­si­ble for an agent’s world-model to ac­cu­rately re­flect the world.

The liar para­dox con­cerns the sta­tus of the sen­tence “This sen­tence is not true”. If it were true, it must be false; and if not true, it must be true.

The difficulty comes in part from try­ing to draw a map of a ter­ri­tory which in­cludes the map it­self.

This is fine if the world “holds still” for us; but be­cause the map is in the world, it may im­ple­ment some func­tion.

Sup­pose our goal is to make an ac­cu­rate map of the fi­nal route of a road which is cur­rently un­der con­struc­tion. Sup­pose we also know that the con­struc­tion team will get to see our map, and that con­struc­tion will pro­ceed so as to dis­prove what­ever map we make. This puts us in a liar-para­dox-like situ­a­tion.

Prob­lems of this kind be­come rele­vant for de­ci­sion-mak­ing in the the­ory of games. A sim­ple game of rock-pa­per-scis­sors can in­tro­duce a liar para­dox if the play­ers try to win, and can pre­dict each other bet­ter than chance.

Game the­ory solves this type of prob­lem with game-the­o­retic equil­ibria. But the prob­lem ends up com­ing back in a differ­ent way.

I men­tioned that the prob­lem of re­al­iz­abil­ity takes on a differ­ent char­ac­ter in the con­text of game the­ory. In an ML set­ting, re­al­iz­abil­ity is a po­ten­tially un­re­al­is­tic as­sump­tion, but can usu­ally be as­sumed con­sis­tently nonethe­less.

In game the­ory, on the other hand, the as­sump­tion it­self may be in­con­sis­tent. This is be­cause games com­monly yield para­doxes of self-refer­ence.

Be­cause there are so many agents, it is no longer pos­si­ble in game the­ory to con­ve­niently make an “agent” a thing which is larger than a world. So game the­o­rists are forced to in­ves­ti­gate no­tions of ra­tio­nal agency which can han­dle a large world.

Un­for­tu­nately, this is done by split­ting up the world into “agent” parts and “non-agent” parts, and han­dling the agents in a spe­cial way. This is al­most as bad as du­al­is­tic mod­els of agency.

In rock-pa­per-scis­sors, the liar para­dox is re­solved by stipu­lat­ing that each player play each move with prob­a­bil­ity. If one player plays this way, then the other loses noth­ing by do­ing so. This way of in­tro­duc­ing prob­a­bil­is­tic play to re­solve would-be para­doxes of game the­ory is called a Nash equil­ibrium.

We can use Nash equil­ibria to pre­vent the as­sump­tion that the agents cor­rectly un­der­stand the world they’re in from be­ing in­con­sis­tent. How­ever, that works just by tel­ling the agents what the world looks like. What if we want to model agents who learn about the world, more like AIXI?

The grain of truth prob­lem is the prob­lem of for­mu­lat­ing a rea­son­ably bound prior prob­a­bil­ity dis­tri­bu­tion which would al­low agents play­ing games to place some pos­i­tive prob­a­bil­ity on each other’s true (prob­a­bil­is­tic) be­hav­ior, with­out know­ing it pre­cisely from the start.

Un­til re­cently, known solu­tions to the prob­lem were quite limited. Benja Fallen­stein, Jes­sica Tay­lor, and Paul Chris­ti­ano’s “Reflec­tive Or­a­cles: A Foun­da­tion for Clas­si­cal Game The­ory” pro­vides a very gen­eral solu­tion. (For de­tails, see “A For­mal Solu­tion to the Grain of Truth Prob­lem” by Jan Leike, Jes­sica Tay­lor, and Benja Fallen­stein.)

Reflec­tive or­a­cles also solve the prob­lem with game-the­o­retic no­tions of ra­tio­nal­ity I men­tioned ear­lier. It al­lows agents to be rea­soned about in the same man­ner as other parts of the en­vi­ron­ment, rather than treat­ing them as a fun­da­men­tally spe­cial case.

If a Tur­ing ma­chine can use an or­a­cle ma­chine to solve a prob­lem, then we can always jump to a prob­lem that is un­de­cid­able even with that or­a­cle; and if we bring in a stronger or­a­cle to solve the new prob­lem, then we can re­peat this pro­cess with the new or­a­cle.

Reflec­tive or­a­cles work by twist­ing the or­di­nary Tur­ing uni­verse back on it­self, so that rather than an in­finite hi­er­ar­chy of ever-stronger or­a­cles, you define an or­a­cle that serves as its own or­a­cle ma­chine.

This would nor­mally in­tro­duce con­tra­dic­tions, but re­flec­tive or­a­cles avoid this by ran­dom­iz­ing their out­put in cases where they would run into para­doxes. As a re­sult, an agent with ac­cess to a re­flec­tive or­a­cle can rea­son about the be­hav­ior of any other agent with ac­cess to a re­flec­tive or­a­cle.

How­ever, mod­els of ra­tio­nal agents based on re­flec­tive or­a­cles still have sev­eral ma­jor limi­ta­tions. One of these is that agents are re­quired to have un­limited pro­cess­ing power, just like AIXI, and so are as­sumed to know all of the con­se­quences of their own be­liefs.

In fact, know­ing all the con­se­quences of your be­liefs—a prop­erty known as log­i­cal om­ni­science—turns out to be rather core to clas­si­cal Bayesian ra­tio­nal­ity.

#### 3.3. Log­i­cal uncertainty

So far, I’ve been talk­ing in a fairly naive way about the agent hav­ing be­liefs about hy­pothe­ses, and the real world be­ing or not be­ing in the hy­poth­e­sis space.

It isn’t re­ally clear what any of that means.

Depend­ing on how we define things, it may ac­tu­ally be quite pos­si­ble for an agent to be smaller than the world and yet con­tain the right world-model—it might know the true physics and ini­tial con­di­tions, but only be ca­pa­ble of in­fer­ring their con­se­quences very ap­prox­i­mately.

Uncer­tainty about the con­se­quences of your be­liefs is log­i­cal un­cer­tainty. In this case, the agent might be em­piri­cally cer­tain of a unique math­e­mat­i­cal de­scrip­tion pin­point­ing which uni­verse she’s in, while be­ing log­i­cally un­cer­tain of most con­se­quences of that de­scrip­tion.

Logic and prob­a­bil­ity the­ory are two great triumphs in the cod­ifi­ca­tion of ra­tio­nal thought. How­ever, the two don’t work to­gether as well as one might think.

Prob­a­bil­ity is like a scale, with wor­lds as weights. An ob­ser­va­tion elimi­nates some of the pos­si­ble wor­lds, re­mov­ing weights and shift­ing the bal­ance of be­liefs.

Logic is like a tree, grow­ing from the seed of ax­ioms ac­cord­ing to in­fer­ence rules. For real-world agents, the pro­cess of growth is never com­plete; you never know all the con­se­quences of each be­lief.

Bayesian hy­poth­e­sis test­ing re­quires each hy­poth­e­sis to clearly de­clare which prob­a­bil­ities it as­signs to which ob­ser­va­tions. That way, you know how much to rescale the odds when you make an ob­ser­va­tion. If we don’t know the con­se­quences of a be­lief, we don’t know how much credit to give it for mak­ing pre­dic­tions.

This is like not know­ing where to place the weights on the scales of prob­a­bil­ity. We could try putting weights on both sides un­til a proof rules one out, but then the be­liefs just os­cillate for­ever rather than do­ing any­thing use­ful.

This forces us to grap­ple di­rectly with the prob­lem of a world that’s larger than the agent. We want some no­tion of bound­edly ra­tio­nal be­liefs about un­cer­tain con­se­quences; but any com­putable be­liefs about logic must have left out some­thing, since the tree of log­i­cal im­pli­ca­tions will grow larger than any con­tainer.

For a Bayesian, the scales of prob­a­bil­ity are bal­anced in pre­cisely such a way that no Dutch book can be made against them—no se­quence of bets that are a sure loss. But you can only ac­count for all Dutch books if you know all the con­se­quences of your be­liefs. Ab­sent that, some­one who has ex­plored other parts of the tree can Dutch-book you.

But hu­man math­e­mat­i­ci­ans don’t seem to run into any spe­cial difficulty in rea­son­ing about math­e­mat­i­cal un­cer­tainty, any more than we do with em­piri­cal un­cer­tainty. So what char­ac­ter­izes good rea­son­ing un­der math­e­mat­i­cal un­cer­tainty, if not im­mu­nity to mak­ing bad bets?

One an­swer is to weaken the no­tion of Dutch books so that we only al­low bets based on quickly com­putable parts of the tree. This is one of the ideas be­hind Garrabrant et al.’s “Log­i­cal In­duc­tion”, an early at­tempt at defin­ing some­thing like “Solomonoff in­duc­tion, but for rea­son­ing that in­cor­po­rates math­e­mat­i­cal un­cer­tainty”.

#### 3.4. High-level models

Another con­se­quence of the fact that the world is big­ger than you is that you need to be able to use high-level world mod­els: mod­els which in­volve things like ta­bles and chairs.

This is re­lated to the clas­si­cal sym­bol ground­ing prob­lem; but since we want a for­mal anal­y­sis which in­creases our trust in some sys­tem, the kind of model which in­ter­ests us is some­what differ­ent. This also re­lates to trans­parency and in­formed over­sight: world-mod­els should be made out of un­der­stand­able parts.

A re­lated ques­tion is how high-level rea­son­ing and low-level rea­son­ing re­late to each other and to in­ter­me­di­ate lev­els: multi-level world mod­els.

Stan­dard prob­a­bil­is­tic rea­son­ing doesn’t provide a very good ac­count of this sort of thing. It’s as though you have differ­ent Bayes nets which de­scribe the world at differ­ent lev­els of ac­cu­racy, and pro­cess­ing power limi­ta­tions force you to mostly use the less ac­cu­rate ones, so you have to de­cide how to jump to the more ac­cu­rate as needed.

Ad­di­tion­ally, the mod­els at differ­ent lev­els don’t line up perfectly, so you have a prob­lem of trans­lat­ing be­tween them; and the mod­els may have se­ri­ous con­tra­dic­tions be­tween them. This might be fine, since high-level mod­els are un­der­stood to be ap­prox­i­ma­tions any­way, or it could sig­nal a se­ri­ous prob­lem in the higher- or lower-level mod­els, re­quiring their re­vi­sion.

This is es­pe­cially in­ter­est­ing in the case of on­tolog­i­cal crises, in which ob­jects we value turn out not to be a part of “bet­ter” mod­els of the world.

It seems fair to say that ev­ery­thing hu­mans value ex­ists in high-level mod­els only, which from a re­duc­tion­is­tic per­spec­tive is “less real” than atoms and quarks. How­ever, be­cause our val­ues aren’t defined on the low level, we are able to keep our val­ues even when our knowl­edge of the low level rad­i­cally shifts. (We would also like to be able to say some­thing about what hap­pens to val­ues if the high level rad­i­cally shifts.)

Another crit­i­cal as­pect of em­bed­ded world mod­els is that the agent it­self must be in the model, since the agent seeks to un­der­stand the world, and the world can­not be fully sep­a­rated from one­self. This opens the door to difficult prob­lems of self-refer­ence and an­thropic de­ci­sion the­ory.

Nat­u­ral­ized in­duc­tion is the prob­lem of learn­ing world-mod­els which in­clude your­self in the en­vi­ron­ment. This is challeng­ing be­cause (as Cas­par Oester­held has put it) there is a type mis­match be­tween “men­tal stuff” and “physics stuff”.

AIXI con­ceives of the en­vi­ron­ment as if it were made with a slot which the agent fits into. We might in­tu­itively rea­son in this way, but we can also un­der­stand a phys­i­cal per­spec­tive from which this looks like a bad model. We might imag­ine in­stead that the agent sep­a­rately rep­re­sents: self-knowl­edge available to in­tro­spec­tion; hy­pothe­ses about what the uni­verse is like; and a “bridg­ing hy­poth­e­sis” con­nect­ing the two.

There are in­ter­est­ing ques­tions of how this could work. There’s also the ques­tion of whether this is the right struc­ture at all. It’s cer­tainly not how I imag­ine ba­bies learn­ing.

Thomas Nagel would say that this way of ap­proach­ing the prob­lem in­volves “views from nowhere”; each hy­poth­e­sis posits a world as if seen from out­side. This is per­haps a strange thing to do.

A spe­cial case of agents need­ing to rea­son about them­selves is agents need­ing to rea­son about their fu­ture self.

To make long-term plans, agents need to be able to model how they’ll act in the fu­ture, and have a cer­tain kind of trust in their fu­ture goals and rea­son­ing abil­ities. This in­cludes trust­ing fu­ture selves that have learned and grown a great deal.

In a tra­di­tional Bayesian frame­work, “learn­ing” means Bayesian up­dat­ing. But as we noted, Bayesian up­dat­ing re­quires that the agent start out large enough to con­sider a bunch of ways the world can be, and learn by rul­ing some of these out.

Embed­ded agents need re­source-limited, log­i­cally un­cer­tain up­dates, which don’t work like this.

Un­for­tu­nately, Bayesian up­dat­ing is the main way we know how to think about an agent pro­gress­ing through time as one unified agent. The Dutch book jus­tifi­ca­tion for Bayesian rea­son­ing is ba­si­cally say­ing this kind of up­dat­ing is the only way to not have the agent’s ac­tions on Mon­day work at cross pur­poses, at least a lit­tle, to the agent’s ac­tions on Tues­day.

Embed­ded agents are non-Bayesian. And non-Bayesian agents tend to get into wars with their fu­ture selves.

Which brings us to our next set of prob­lems: ro­bust del­e­ga­tion.

### 4. Ro­bust delegation

Be­cause the world is big, the agent as it is may be in­ad­e­quate to ac­com­plish its goals, in­clud­ing in its abil­ity to think.

Be­cause the agent is made of parts, it can im­prove it­self and be­come more ca­pa­ble.

Im­prove­ments can take many forms: The agent can make tools, the agent can make suc­ces­sor agents, or the agent can just learn and grow over time. How­ever, the suc­ces­sors or tools need to be more ca­pa­ble for this to be worth­while.

This gives rise to a spe­cial type of prin­ci­pal/​agent prob­lem:

You have an ini­tial agent, and a suc­ces­sor agent. The ini­tial agent gets to de­cide ex­actly what the suc­ces­sor agent looks like. The suc­ces­sor agent, how­ever, is much more in­tel­li­gent and pow­er­ful than the ini­tial agent. We want to know how to have the suc­ces­sor agent ro­bustly op­ti­mize the ini­tial agent’s goals.

The prob­lem is not (just) that the suc­ces­sor agent might be mal­i­cious. The prob­lem is that we don’t even know what it means not to be.

This prob­lem seems hard from both points of view.

The ini­tial agent needs to figure out how re­li­able and trust­wor­thy some­thing more pow­er­ful than it is, which seems very hard. But the suc­ces­sor agent has to figure out what to do in situ­a­tions that the ini­tial agent can’t even un­der­stand, and try to re­spect the goals of some­thing that the suc­ces­sor can see is in­con­sis­tent, which also seems very hard.

At first, this may look like a less fun­da­men­tal prob­lem than “make de­ci­sions” or “have mod­els”. But the view on which there are mul­ti­ple forms of the “build a suc­ces­sor” prob­lem is a du­al­is­tic view.

To an em­bed­ded agent, the fu­ture self is not priv­ileged; it is just an­other part of the en­vi­ron­ment. There isn’t a deep differ­ence be­tween build­ing a suc­ces­sor that shares your goals, and just mak­ing sure your own goals stay the same over time.

So, al­though I talk about “ini­tial” and “suc­ces­sor” agents, re­mem­ber that this isn’t just about the nar­row prob­lem hu­mans cur­rently face of aiming a suc­ces­sor. This is about the fun­da­men­tal prob­lem of be­ing an agent that per­sists and learns over time.

We call this cluster of prob­lems Ro­bust Del­e­ga­tion. Ex­am­ples in­clude:

#### 4.1. Vingean reflection

Imag­ine you are play­ing the CIRL game with a tod­dler.

CIRL means Co­op­er­a­tive In­verse Re­in­force­ment Learn­ing. The idea be­hind CIRL is to define what it means for a robot to col­lab­o­rate with a hu­man. The robot tries to pick helpful ac­tions, while si­mul­ta­neously try­ing to figure out what the hu­man wants.

Usu­ally, we think about this from the point of view of the hu­man. But now con­sider the prob­lem faced by the robot, where they’re try­ing to help some­one who is very con­fused about the uni­verse. Imag­ine try­ing to help a tod­dler op­ti­mize their goals.

• From your stand­point, the tod­dler may be too ir­ra­tional to be seen as op­ti­miz­ing any­thing.

• The tod­dler may have an on­tol­ogy in which it is op­ti­miz­ing some­thing, but you can see that on­tol­ogy doesn’t make sense.

• Maybe you no­tice that if you set up ques­tions in the right way, you can make the tod­dler seem to want al­most any­thing.

Part of the prob­lem is that the “helping” agent has to be big­ger in some sense in or­der to be more ca­pa­ble; but this seems to im­ply that the “helped” agent can’t be a very good su­per­vi­sor for the “helper”.

For ex­am­ple, up­date­less de­ci­sion the­ory elimi­nates dy­namic in­con­sis­ten­cies in de­ci­sion the­ory by, rather than max­i­miz­ing ex­pected util­ity of your ac­tion given what you know, max­i­miz­ing ex­pected util­ity of re­ac­tions to ob­ser­va­tions, from a state of ig­no­rance.

Ap­peal­ing as this may be as a way to achieve re­flec­tive con­sis­tency, it cre­ates a strange situ­a­tion in terms of com­pu­ta­tional com­plex­ity: If ac­tions are type , and ob­ser­va­tions are type , re­ac­tions to ob­ser­va­tions are type —a much larger space to op­ti­mize over than alone. And we’re ex­pect­ing our smaller self to be able to do that!

One way to more crisply state the prob­lem is: We should be able to trust that our fu­ture self is ap­ply­ing its in­tel­li­gence to the pur­suit of our goals with­out be­ing able to pre­dict pre­cisely what our fu­ture self will do. This crite­rion is called Vingean re­flec­tion.

For ex­am­ple, you might plan your driv­ing route be­fore vis­it­ing a new city, but you do not plan your steps. You plan to some level of de­tail, and trust that your fu­ture self can figure out the rest.

Vingean re­flec­tion is difficult to ex­am­ine via clas­si­cal Bayesian de­ci­sion the­ory be­cause Bayesian de­ci­sion the­ory as­sumes log­i­cal om­ni­science. Given log­i­cal om­ni­science, the as­sump­tion “the agent knows its fu­ture ac­tions are ra­tio­nal” is syn­ony­mous with the as­sump­tion “the agent knows its fu­ture self will act ac­cord­ing to one par­tic­u­lar op­ti­mal policy which the agent can pre­dict in ad­vance”.

We have some limited mod­els of Vingean re­flec­tion (see “Tiling Agents for Self-Mod­ify­ing AI, and the Löbian Ob­sta­cle” by Yud­kowsky and Her­reshoff). A suc­cess­ful ap­proach must walk the nar­row line be­tween two prob­lems:

• The Löbian Ob­sta­cle: Agents who trust their fu­ture self be­cause they trust the out­put of their own rea­son­ing are in­con­sis­tent.

• The Pro­cras­ti­na­tion Para­dox: Agents who trust their fu­ture selves with­out rea­son tend to be con­sis­tent but un­sound and un­trust­wor­thy, and will put off tasks for­ever be­cause they can do them later.

The Vingean re­flec­tion re­sults so far ap­ply only to limited sorts of de­ci­sion pro­ce­dures, such as satis­ficers aiming for a thresh­old of ac­cept­abil­ity. So there is plenty of room for im­prove­ment, get­ting tiling re­sults for more use­ful de­ci­sion pro­ce­dures and un­der weaker as­sump­tions.

How­ever, there is more to the ro­bust del­e­ga­tion prob­lem than just tiling and Vingean re­flec­tion.

When you con­struct an­other agent, rather than del­e­gat­ing to your fu­ture self, you more di­rectly face a prob­lem of value load­ing.

#### 4.2. Good­hart’s law

The main prob­lems in the con­text of value load­ing:

The mis­speci­fi­ca­tion-am­plify­ing effect is known as Good­hart’s law, named for Charles Good­hart’s ob­ser­va­tion: “Any ob­served statis­ti­cal reg­u­lar­ity will tend to col­lapse once pres­sure is placed upon it for con­trol pur­poses.”

When we spec­ify a tar­get for op­ti­miza­tion, it is rea­son­able to ex­pect it to be cor­re­lated with what we want—highly cor­re­lated, in some cases. Un­for­tu­nately, how­ever, this does not mean that op­ti­miz­ing it will get us closer to what we want—es­pe­cially at high lev­els of op­ti­miza­tion.

There are (at least) four types of Good­hart: re­gres­sional, causal, ex­tremal, and ad­ver­sar­ial.

Re­gres­sional Good­hart hap­pens when there is a less than perfect cor­re­la­tion be­tween the proxy and the goal. It is more com­monly known as the op­ti­mizer’s curse, and it is re­lated to re­gres­sion to the mean.

An un­bi­ased es­ti­mate of given is not an un­bi­ased es­ti­mate of when we se­lect for the best . In that sense, we can ex­pect to be dis­ap­pointed when we use as a proxy for for op­ti­miza­tion pur­poses.

Us­ing a Bayes es­ti­mate in­stead of an un­bi­ased es­ti­mate, we can elimi­nate this sort of pre­dictable dis­ap­point­ment.

This doesn’t nec­es­sar­ily al­low us to get a bet­ter value, since we still only have the in­for­ma­tion con­tent of to work with. How­ever, it some­times may. If is nor­mally dis­tributed with var­i­ance , and is with even odds of or , a Bayes es­ti­mate will give bet­ter op­ti­miza­tion re­sults by al­most en­tirely re­mov­ing the noise.

Causal Good­hart hap­pens when you ob­serve a cor­re­la­tion be­tween proxy and goal, but when you in­ter­vene to in­crease the proxy, you fail to in­crease the goal be­cause the ob­served cor­re­la­tion was not causal in the right way. Teas­ing cor­re­la­tion apart from cau­sa­tion is run-of-the-mill coun­ter­fac­tual rea­son­ing.

In ex­tremal Good­hart, op­ti­miza­tion pushes you out­side the range where the cor­re­la­tion ex­ists, into por­tions of the dis­tri­bu­tion which be­have very differ­ently. This is es­pe­cially scary be­cause it tends to have phase shifts. You might not be able to ob­serve the proxy break­ing down at all when you have weak op­ti­miza­tion, but once the op­ti­miza­tion be­comes strong enough, you can en­ter a very differ­ent do­main.

Ex­tremal Good­hart is similar to re­gres­sional Good­hart, but we can’t cor­rect it with Bayes es­ti­ma­tors if we don’t have the right model—oth­er­wise, there seems to be no rea­son why the Bayes es­ti­ma­tor it­self should not be sus­cep­ti­ble to ex­tremal Good­hart.

If you have a prob­a­bil­ity dis­tri­bu­tion such that the proxy is only a bound­edly bad ap­prox­i­ma­tion of on av­er­age, quan­tiliza­tion avoids ex­tremal Good­hart by se­lect­ing ran­domly from for some thresh­old . If we pick a thresh­old that is high but not ex­treme, we can hope that the risk of se­lect­ing from out­liers with very differ­ent be­hav­ior will be small, and that is likely to be large.

This is helpful, but un­like Bayes es­ti­ma­tors for re­gres­sional Good­hart, doesn’t nec­es­sar­ily seem like the end of the story. Maybe we can do bet­ter.

Fi­nally, there is ad­ver­sar­ial Good­hart, in which agents ac­tively make our proxy worse by in­tel­li­gently ma­nipu­lat­ing it. This is even harder to ob­serve at low lev­els of op­ti­miza­tion, both be­cause the ad­ver­saries won’t want to start ma­nipu­lat­ing un­til af­ter test time is over, and be­cause ad­ver­saries that come from the sys­tem’s own op­ti­miza­tion won’t show up un­til the op­ti­miza­tion is pow­er­ful enough.

Th­ese differ­ent types of Good­hart effects work in very differ­ent ways, and, roughly speak­ing, they tend to start ap­pear­ing at suc­ces­sively higher lev­els of op­ti­miza­tion power—so be care­ful not to think you’ve con­quered Good­hart’s law be­cause you’ve solved some of them.

#### 4.3. Stable poin­t­ers to what we value

Be­sides anti-Good­hart mea­sures, it would ob­vi­ously help to be able to spec­ify what we want pre­cisely.

Un­for­tu­nately, this is hard; so can the AI sys­tem we’re build­ing help us with this? More gen­er­ally, can a suc­ces­sor agent help its pre­de­ces­sor solve this? Maybe it can use its in­tel­lec­tual ad­van­tages to figure out what we want?

AIXI learns what to do through a re­ward sig­nal which it gets from the en­vi­ron­ment. We can imag­ine hu­mans have a but­ton which they press when AIXI does some­thing they like.

The prob­lem with this is that AIXI will ap­ply its in­tel­li­gence to the prob­lem of tak­ing con­trol of the re­ward but­ton. This is the prob­lem of wire­head­ing.

Maybe we build the re­ward but­ton into the agent, as a black box which is­sues re­wards based on what is go­ing on. The box could be an in­tel­li­gent sub-agent in its own right, which figures out what re­wards hu­mans would want to give. The box could even defend it­self by is­su­ing pun­ish­ments for ac­tions aimed at mod­ify­ing the box.

In the end, though, if the agent un­der­stands the situ­a­tion, it will be mo­ti­vated to take con­trol any­way.

There is a crit­i­cal dis­tinc­tion be­tween op­ti­miz­ing “” in quo­ta­tion marks and op­ti­miz­ing di­rectly. If the agent is com­ing up with plans to try to achieve a high out­put of the box, and it in­cor­po­rates into its plan­ning its un­cer­tainty re­gard­ing the out­put of the box, then it will want to hack the box. How­ever, if you run the ex­pected out­comes of plans through the ac­tual box, then plans to hack the box are eval­u­ated by the cur­rent box, so they don’t look par­tic­u­larly ap­peal­ing.

Daniel Dewey calls the sec­ond sort of agent an ob­ser­va­tion-util­ity max­i­mizer. (Others have in­cluded ob­ser­va­tion-util­ity agents within a more gen­eral no­tion of re­in­force­ment learn­ing.)

I find it very in­ter­est­ing how you can try all sorts of things to stop an RL agent from wire­head­ing, but the agent keeps work­ing against it. Then, you make the shift to ob­ser­va­tion-util­ity agents and the prob­lem van­ishes.

It seems like the in­di­rec­tion it­self is the prob­lem. RL agents max­i­mize the out­put of the box; ob­ser­va­tion-util­ity agents max­i­mize . So the challenge is to cre­ate sta­ble poin­t­ers to what we value: a no­tion of “in­di­rec­tion” which serves to point at val­ues not di­rectly available to be op­ti­mized.

Ob­ser­va­tion-util­ity agents solve the clas­sic wire­head­ing prob­lem, but we still have the prob­lem of spec­i­fy­ing . So we add a level of in­di­rec­tion back in: we rep­re­sent our un­cer­tainty over , and try to learn. Daniel Dewey doesn’t provide any sug­ges­tion for how to do this, but CIRL is one ex­am­ple.

Un­for­tu­nately, the wire­head­ing prob­lem can come back in even worse fash­ion. For ex­am­ple, if there is a drug which mod­ifies hu­man prefer­ences to only care about us­ing the drug, a CIRL agent could be highly mo­ti­vated to give hu­mans that drug in or­der to make its job eas­ier. This is called the hu­man ma­nipu­la­tion prob­lem.<p>

The les­son I want to draw from this is from “Re­in­force­ment Learn­ing with a Cor­rupted Re­ward Chan­nel” (by Tom Ever­itt et al.): the way you set up the feed­back loop makes a huge differ­ence.

They draw the fol­low­ing pic­ture:

• In Stan­dard RL, the feed­back about the value of a state comes from the state it­self, so cor­rupt states can be “self-ag­gran­diz­ing”.

• In De­cou­pled RL, the feed­back about the qual­ity of a state comes from some other state, mak­ing it pos­si­ble to learn cor­rect val­ues even when some feed­back is cor­rupt.

In some sense, the challenge is to put the origi­nal, small agent in the feed­back loop in the right way. How­ever, the prob­lems with up­date­less rea­son­ing men­tioned ear­lier make this hard; the origi­nal agent doesn’t know enough.

One way to try to ad­dress this is through in­tel­li­gence am­plifi­ca­tion: try to turn the origi­nal agent into a more ca­pa­ble one with the same val­ues, rather than cre­at­ing a suc­ces­sor agent from scratch and try­ing to get value load­ing right.

For ex­am­ple, Paul Chris­ti­ano pro­poses an ap­proach in which the small agent is simu­lated many times in a large tree, which can perform com­plex com­pu­ta­tions by split­ting prob­lems into parts.

How­ever, this is still fairly de­mand­ing for the small agent: it doesn’t just need to know how to break prob­lems down into more tractable pieces; it also needs to know how to do so with­out giv­ing rise to ma­lign sub­com­pu­ta­tions.

For ex­am­ple, since the small agent can use the copies of it­self to get a lot of com­pu­ta­tional power, it could eas­ily try to use a brute-force search for solu­tions that ends up run­ning afoul of Good­hart’s law.

This is­sue is the sub­ject of the next sec­tion: sub­sys­tem al­ign­ment.

### 5. Sub­sys­tem alignment

You want to figure some­thing out, but you don’t know how to do that yet.

You have to some­how break up the task into sub-com­pu­ta­tions. There is no atomic act of “think­ing”; in­tel­li­gence must be built up of non-in­tel­li­gent parts.

The agent be­ing made of parts is part of what made coun­ter­fac­tu­als hard, since the agent may have to rea­son about im­pos­si­ble con­figu­ra­tions of those parts.

Be­ing made of parts is what makes self-rea­son­ing and self-mod­ifi­ca­tion even pos­si­ble.

What we’re pri­mar­ily go­ing to dis­cuss in this sec­tion, though, is an­other prob­lem: when the agent is made of parts, there could be ad­ver­saries not just in the ex­ter­nal en­vi­ron­ment, but in­side the agent as well.

This cluster of prob­lems is Sub­sys­tem Align­ment: en­sur­ing that sub­sys­tems are not work­ing at cross pur­poses; avoid­ing sub­pro­cesses op­ti­miz­ing for un­in­tended goals.

• be­nign induction

• be­nign optimization

• transparency

• in­ner optimizers

#### 5.1. Ro­bust­ness to rel­a­tive scale

Here’s a straw agent de­sign:

The epistemic sub­sys­tem just wants ac­cu­rate be­liefs. The in­stru­men­tal sub­sys­tem uses those be­liefs to track how well it is do­ing. If the in­stru­men­tal sub­sys­tem gets too ca­pa­ble rel­a­tive to the epistemic sub­sys­tem, it may de­cide to try to fool the epistemic sub­sys­tem, as de­picted.

If the epistemic sub­sys­tem gets too strong, that could also pos­si­bly yield bad out­comes.

This agent de­sign treats the sys­tem’s epistemic and in­stru­men­tal sub­sys­tems as dis­crete agents with goals of their own, which is not par­tic­u­larly re­al­is­tic. How­ever, we saw in the sec­tion on wire­head­ing that the prob­lem of sub­sys­tems work­ing at cross pur­poses is hard to avoid. And this is a harder prob­lem if we didn’t in­ten­tion­ally build the rele­vant sub­sys­tems.

One rea­son to avoid boot­ing up sub-agents who want differ­ent things is that we want ro­bust­ness to rel­a­tive scale.

An ap­proach is ro­bust to scale if it still works, or fails grace­fully, as you scale ca­pa­bil­ities. There are three types: ro­bust­ness to scal­ing up; ro­bust­ness to scal­ing down; and ro­bust­ness to rel­a­tive scale.

• Ro­bust­ness to scal­ing up means that your sys­tem doesn’t stop be­hav­ing well if it gets bet­ter at op­ti­miz­ing. One way to check this is to think about what would hap­pen if the func­tion the AI op­ti­mizes were ac­tu­ally max­i­mized. Think Good­hart’s law.

• Ro­bust­ness to scal­ing down means that your sys­tem still works if made less pow­er­ful. Of course, it may stop be­ing use­ful; but it should fail safely and with­out un­nec­es­sary costs. Your sys­tem might work if it can ex­actly max­i­mize some func­tion, but is it safe if you ap­prox­i­mate? For ex­am­ple, maybe a sys­tem is safe if it can learn hu­man val­ues very pre­cisely, but ap­prox­i­ma­tion makes it in­creas­ingly mis­al­igned.

• Ro­bust­ness to rel­a­tive scale means that your de­sign does not rely on the agent’s sub­sys­tems be­ing similarly pow­er­ful. For ex­am­ple, GAN (Gen­er­a­tive Ad­ver­sar­ial Net­work) train­ing can fail if one sub-net­work gets too strong, be­cause there’s no longer any train­ing sig­nal.

Lack of ro­bust­ness to scale isn’t nec­es­sar­ily some­thing which kills a pro­posal, but it is some­thing to be aware of; lack­ing ro­bust­ness to scale, you need strong rea­son to think you’re at the right scale.

Ro­bust­ness to rel­a­tive scale is par­tic­u­larly im­por­tant for sub­sys­tem al­ign­ment. An agent with in­tel­li­gent sub-parts should not rely on be­ing able to out­smart them, un­less we have a strong ac­count of why this is always pos­si­ble.

The big-pic­ture moral: aim to have a unified sys­tem that doesn’t work at cross pur­poses to it­self.

Why would any­one make an agent with parts fight­ing against one an­other? There are three ob­vi­ous rea­sons: sub­goals, poin­t­ers, and search.

Split­ting up a task into sub­goals may be the only way to effi­ciently find a solu­tion. How­ever, a sub­goal com­pu­ta­tion shouldn’t com­pletely for­get the big pic­ture!

An agent de­signed to build houses should not boot up a sub-agent who cares only about build­ing stairs.

One in­tu­itive desider­a­tum is that al­though sub­sys­tems need to have their own goals in or­der to de­com­pose prob­lems into parts, the sub­goals need to “point back” ro­bustly to the main goal.

A house-build­ing agent might spin up a sub­sys­tem that cares only about stairs, but only cares about stairs in the con­text of houses.

How­ever, you need to do this in a way that doesn’t just amount to your house-build­ing sys­tem hav­ing a sec­ond house-build­ing sys­tem in­side its head. This brings me to the next item:

Poin­t­ers: It may be difficult for sub­sys­tems to carry the whole-sys­tem goal around with them, since they need to be re­duc­ing the prob­lem. How­ever, this kind of in­di­rec­tion seems to en­courage situ­a­tions in which differ­ent sub­sys­tems’ in­cen­tives are mis­al­igned.

As we saw in the ex­am­ple of the epistemic and in­stru­men­tal sub­sys­tems, as soon as we start op­ti­miz­ing some sort of ex­pec­ta­tion, rather than di­rectly get­ting feed­back about what we’re do­ing on the met­ric that’s ac­tu­ally im­por­tant, we may cre­ate per­verse in­cen­tives—that’s Good­hart’s law.

How do we ask a sub­sys­tem to “do X” as op­posed to “con­vince the wider sys­tem that I’m do­ing X”, with­out pass­ing along the en­tire over­ar­ch­ing goal-sys­tem?

This is similar to the way we wanted suc­ces­sor agents to ro­bustly point at val­ues, since it is too hard to write val­ues down. How­ever, in this case, learn­ing the val­ues of the larger agent wouldn’t make any sense ei­ther; sub­sys­tems and sub­goals need to be smaller.

It might not be that difficult to solve sub­sys­tem al­ign­ment for sub­sys­tems which hu­mans en­tirely de­sign, or sub­goals which an AI ex­plic­itly spins up. If you know how to avoid mis­al­ign­ment by de­sign and ro­bustly del­e­gate your goals, both prob­lems seem solv­able.

How­ever, it doesn’t seem pos­si­ble to de­sign all sub­sys­tems so ex­plic­itly. At some point in solv­ing a prob­lem, you’ve split it up as much as you know how to and must rely on some trial and er­ror.

This brings us to the third rea­son sub­sys­tems might be op­ti­miz­ing differ­ent things, search: solv­ing a prob­lem by look­ing through a rich space of pos­si­bil­ities, a space which may it­self con­tain mis­al­igned sub­sys­tems.

ML re­searchers are quite fa­mil­iar with the phe­nomenon: it’s eas­ier to write a pro­gram which finds a high-perfor­mance ma­chine trans­la­tion sys­tem for you than to di­rectly write one your­self.

In the long run, this pro­cess can go one step fur­ther. For a rich enough prob­lem and an im­pres­sive enough search pro­cess, the solu­tions found via search might them­selves be in­tel­li­gently op­ti­miz­ing some­thing. This prob­lem is de­scribed in Hub­inger, et al.’s forth­com­ing “The In­ner Align­ment Prob­lem”.

Let’s call the outer search pro­cess an “outer op­ti­mizer”, and the in­ner search pro­cess an “in­ner op­ti­mizer”.

“Op­ti­miza­tion” and “search” are am­bigu­ous terms. I’ll think of them as any al­gorithm which can be nat­u­rally in­ter­preted as do­ing sig­nifi­cant com­pu­ta­tional work to “find” an ob­ject that scores highly on some ob­jec­tive func­tion.

The ob­jec­tive func­tion of the outer op­ti­mizer is not nec­es­sar­ily the same as that of the in­ner op­ti­mizer. If the outer op­ti­mizer wants to make pizza, the in­ner op­ti­mizer may en­joy knead­ing dough, chop­ping in­gre­di­ents, et cetera.

The in­ner ob­jec­tive func­tion must be helpful for the outer, at least in the ex­am­ples the outer op­ti­mizer is check­ing. Other­wise, the in­ner op­ti­mizer would not have been se­lected.

How­ever, the in­ner op­ti­mizer must re­duce the prob­lem some­how; there is no point to it run­ning the ex­act same search. So it seems like its ob­jec­tives will tend to be like good heuris­tics; eas­ier to op­ti­mize, but differ­ent from the outer ob­jec­tive in gen­eral.

Why might a differ­ence be­tween in­ner and outer ob­jec­tives be con­cern­ing, if the in­ner op­ti­mizer is scor­ing highly on the outer ob­jec­tive any­way? It’s about the in­ter­play with what’s re­ally wanted. Even if we get value speci­fi­ca­tion ex­actly right, there will always be some dis­tri­bu­tional shift be­tween the train­ing set and de­ploy­ment. (See Amodei, et al.’s “Con­crete Prob­lems in AI Safety”.)

Distri­bu­tional shifts which would be small in or­di­nary cases may make a big differ­ence to a ca­pa­ble in­ner op­ti­mizer, which may ob­serve the slight differ­ence and figure out how to cap­i­tal­ize on it for its own ob­jec­tive.

Ac­tu­ally, to even use the term “dis­tri­bu­tional shift” seems wrong in the con­text of em­bed­ded agency. The world is not i.i.d. The ana­log of “no dis­tri­bu­tional shift” would be to have an ex­act model of the whole fu­ture rele­vant to what you want to op­ti­mize, and the abil­ity to run it over and over dur­ing train­ing. So we need to deal with mas­sive “dis­tri­bu­tional shift”.

We may also want to op­ti­mize for things that aren’t ex­actly what we want. The ob­vi­ous way to avoid agents that pur­sue sub­goals at the cost of the over­all goal is to have the sub­sys­tems not be agen­tic. Just search over a bunch of ways to make stairs, don’t make some­thing that cares about stairs. The prob­lem is then that pow­er­ful in­ner op­ti­miz­ers are op­ti­miz­ing some­thing the outer sys­tem doesn’t care about, and that the in­ner op­ti­miz­ers will have a con­ver­gent in­cen­tive to be agen­tic.

Ad­di­tion­ally, there’s the pos­si­bil­ity that the in­ner op­ti­mizer be­comes aware of the outer op­ti­mizer, in which case it might start ex­plic­itly try­ing to do well on the outer ob­jec­tive func­tion in or­der to be kept around, while look­ing for any signs that it has left train­ing and can stop pre­tend­ing.

This is the same story we saw in ad­ver­sar­ial Good­hart: there is some­thing agen­tic in the search space, which re­sponds to our choice of proxy in a way which makes our proxy a bad one.

If in­tel­li­gent in­ner op­ti­miz­ers de­vel­op­ing in deep neu­ral net­work train­ing seems too hy­po­thet­i­cal, con­sider the evolu­tion of life on Earth. Evolu­tion can be thought of as a re­pro­duc­tive fit­ness max­i­mizer.

(Evolu­tion can ac­tu­ally be thought of as an op­ti­mizer for many things, or as no op­ti­mizer at all, but that doesn’t mat­ter. The point is that if an agent wanted to max­i­mize re­pro­duc­tive fit­ness, it might use a sys­tem that looked like evolu­tion.)

In­tel­li­gent or­ganisms are in­ner op­ti­miz­ers of evolu­tion. Although the drives of in­tel­li­gent or­ganisms are cer­tainly cor­re­lated with re­pro­duc­tive fit­ness, or­ganisms want all sorts of things. There are even in­ner op­ti­miz­ers who have come to un­der­stand evolu­tion, and even to ma­nipu­late it at times. Pow­er­ful and mis­al­igned in­ner op­ti­miz­ers ap­pear to be a real pos­si­bil­ity, then, at least with enough pro­cess­ing power.

Prob­lems seem to arise be­cause you try to solve a prob­lem which you don’t yet know how to solve by search­ing over a large space and hop­ing “some­one” can solve it.

If the source of the is­sue is the solu­tion of prob­lems by mas­sive search, per­haps we should look for differ­ent ways to solve prob­lems. Per­haps we should solve prob­lems by figur­ing things out. But how do you solve prob­lems which you don’t yet know how to solve other than by try­ing things?

Let’s take a step back.

Embed­ded world-mod­els is about how to think at all, as an em­bed­ded agent; de­ci­sion the­ory is about how to act. Ro­bust del­e­ga­tion is about build­ing trust­wor­thy suc­ces­sors and helpers. Sub­sys­tem al­ign­ment is about build­ing one agent out of trust­wor­thy parts.

The prob­lem is that:

• We don’t know how to think about en­vi­ron­ments when we’re smaller.

• To the ex­tent we can do that, we don’t know how to think about con­se­quences of ac­tions in those en­vi­ron­ments.

• Even when we can do that, we don’t know how to think about what we want.

• Even when we have none of these prob­lems, we don’t know how to re­li­ably out­put ac­tions which get us what we want!

### 6. Con­clud­ing thoughts

A fi­nal word on cu­ri­os­ity, and in­tel­lec­tual puz­zles:

I de­scribed an em­bed­ded agent, Emmy, and said that I don’t un­der­stand how she eval­u­ates her op­tions, mod­els the world, mod­els her­self, or de­com­poses and solves prob­lems.

In the past, when re­searchers have talked about mo­ti­va­tions for work­ing on prob­lems like these, they’ve gen­er­ally fo­cused on the mo­ti­va­tion from AI risk. AI re­searchers want to build ma­chines that can solve prob­lems in the gen­eral-pur­pose fash­ion of a hu­man, and du­al­ism is not a re­al­is­tic frame­work for think­ing about such sys­tems. In par­tic­u­lar, it’s an ap­prox­i­ma­tion that’s es­pe­cially prone to break­ing down as AI sys­tems get smarter. When peo­ple figure out how to build gen­eral AI sys­tems, we want those re­searchers to be in a bet­ter po­si­tion to un­der­stand their sys­tems, an­a­lyze their in­ter­nal prop­er­ties, and be con­fi­dent in their fu­ture be­hav­ior.

This is the mo­ti­va­tion for most re­searchers to­day who are work­ing on things like up­date­less de­ci­sion the­ory and sub­sys­tem al­ign­ment. We care about ba­sic con­cep­tual puz­zles which we think we need to figure out in or­der to achieve con­fi­dence in fu­ture AI sys­tems, and not have to rely quite so much on brute-force search or trial and er­ror.

But the ar­gu­ments for why we may or may not need par­tic­u­lar con­cep­tual in­sights in AI are pretty long. I haven’t tried to wade into the de­tails of that de­bate here. In­stead, I’ve been dis­cussing a par­tic­u­lar set of re­search di­rec­tions as an in­tel­lec­tual puz­zle, and not as an in­stru­men­tal strat­egy.

One down­side of dis­cussing these prob­lems as in­stru­men­tal strate­gies is that it can lead to some mi­s­un­der­stand­ings about why we think this kind of work is so im­por­tant. With the “in­stru­men­tal strate­gies” lens, it’s tempt­ing to draw a di­rect line from a given re­search prob­lem to a given safety con­cern. But it’s not that I’m imag­in­ing real-world em­bed­ded sys­tems be­ing “too Bayesian” and this some­how caus­ing prob­lems, if we don’t figure out what’s wrong with cur­rent mod­els of ra­tio­nal agency. It’s cer­tainly not that I’m imag­in­ing fu­ture AI sys­tems be­ing writ­ten in sec­ond-or­der logic! In most cases, I’m not try­ing at all to draw di­rect lines be­tween re­search prob­lems and spe­cific AI failure modes.

What I’m in­stead think­ing about is this: We sure do seem to be work­ing with the wrong ba­sic con­cepts to­day when we try to think about what agency is, as seen by the fact that these con­cepts don’t trans­fer well to the more re­al­is­tic em­bed­ded frame­work.

If AI de­vel­op­ers in the fu­ture are still work­ing with these con­fused and in­com­plete ba­sic con­cepts as they try to ac­tu­ally build pow­er­ful real-world op­ti­miz­ers, that seems like a bad po­si­tion to be in. And it seems like the re­search com­mu­nity is un­likely to figure most of this out by de­fault in the course of just try­ing to de­velop more ca­pa­ble sys­tems. Evolu­tion cer­tainly figured out how to build hu­man brains with­out “un­der­stand­ing” any of this, via brute-force search.

Embed­ded agency is my way of try­ing to point at what I think is a very im­por­tant and cen­tral place where I feel con­fused, and where I think fu­ture re­searchers risk run­ning into con­fu­sions too.

There’s also a lot of ex­cel­lent AI al­ign­ment re­search that’s be­ing done with an eye to­ward more di­rect ap­pli­ca­tions; but I think of that safety re­search as hav­ing a differ­ent type sig­na­ture than the puz­zles I’ve talked about here.

In­tel­lec­tual cu­ri­os­ity isn’t the ul­ti­mate rea­son we priv­ilege these re­search di­rec­tions. But there are some prac­ti­cal ad­van­tages to ori­ent­ing to­ward re­search ques­tions from a place of cu­ri­os­ity at times, as op­posed to only ap­ply­ing the “prac­ti­cal im­pact” lens to how we think about the world.

When we ap­ply the cu­ri­os­ity lens to the world, we ori­ent to­ward the sources of con­fu­sion pre­vent­ing us from see­ing clearly; the blank spots in our map, the flaws in our lens. It en­courages re-check­ing as­sump­tions and at­tend­ing to blind spots, which is helpful as a psy­cholog­i­cal coun­ter­point to our “in­stru­men­tal strat­egy” lens—the lat­ter be­ing more vuln­er­a­ble to the urge to lean on what­ever shaky premises we have on hand so we can get to more solidity and clo­sure in our early think­ing.

Embed­ded agency is an or­ga­niz­ing theme be­hind most, if not all, of our big cu­ri­osi­ties. It seems like a cen­tral mys­tery un­der­ly­ing many con­crete difficul­ties.

Bibliog­ra­phy
No nominations.
No reviews.
• The above is the full Embed­ded Agency se­quence, cross-posted from the MIRI web­site so that it’s eas­ier to find the text ver­sion on AIAF/​LW (via search, se­quences, au­thor pages, etc.).

Scott and Abram have added a new sec­tion on self-refer­ence to the se­quence since it was first posted, and slightly ex­panded the sub­se­quent sec­tion on log­i­cal un­cer­tainty and the start of the ro­bust del­e­ga­tion sec­tion.

• Pro­moted to cu­rated: I think the con­tent of this se­quence is quite im­por­tant, both for ra­tio­nal­ity and AI Align­ment. I also quite ap­pre­ci­ate the care that went into the pre­sen­ta­tion, and think the whole text is a prime ex­am­ple in a text that is re­ally fo­cus­ing on ex­plain­ing things, in­stead of try­ing to per­suade the reader of a con­clu­sion.

I also think the last sec­tion gen­er­al­izes quite well to do­mains other than AI Align­ment. I think a lot of the best sci­ence looks like look­ing for fun­da­men­tal con­fu­sions, in the way this se­quence is do­ing it, and I would love to see more posts in it style for do­mains like Psy­chol­ogy, Eco­nomics and in­di­vi­d­ual ra­tio­nal­ity.

• I’m pretty im­pressed by this, and es­pe­cially the con­tent on em­bed­ded agents causes me to up­date in the di­rec­tion of think­ing MIRI re­searchers are less con­fused about cer­tain is­sues of episte­mol­ogy than I pre­vi­ously thought. I would have framed some of these is­sues differ­ently, but over­all I can com­plain far less than I have in the past based on what I’ve read here.

• Con­cern­ing the 5 and 10 prob­lem—I’m cu­ri­ous if any works been done try­ing to re­solve this by us­ing a weaker logic? I’m not a lo­gi­cian, but a rele­vance logic seems worth look­ing into. At least on the face of it, tak­ing away the prin­ci­ple of ex­plo­sion is a step to­wards mak­ing the men­tioned P, “if the agent out­puts 5 the uni­verse out­puts 5, and if the agent out­puts 10 the uni­verse out­puts 0” un­prov­able.

I’d be in­ter­ested in any other work on the 5 and 10 prob­lem also.

• You may want to add MIRI’s bot­world 1.0 pro­ject to the bibliog­ra­phy, so that peo­ple look­ing into this don’t du­pli­cate the idea

• Peo­ple of­ten try to solve the prob­lem of coun­ter­fac­tu­als by sug­gest­ing that there will always be some un­cer­tainty. An AI may know its source code perfectly, but it can’t perfectly know the hard­ware it is run­ning on.

How could Emmy, an em­bed­ded agent, know its source code perfectly, or even be cer­tain that it is a com­put­ing de­vice un­der the Church-Tur­ing defi­ni­tion? Such cer­tainty would seem dog­matic. Without such cer­tainty, the choice of 10 rather than 5 can­not be firmly clas­sified as an er­ror. (The clas­sifi­ca­tion as an er­ror seemed to play an im­por­tant role in your dis­cus­sion.) So Emmy has a mo­ti­va­tion to keep look­ing and find that U(10)=10.

• Very sur­prised that Emmy is not treated as an agent driven by a (pre­dic­tive) model of causal re­la­tion­ships. How else could an em­bod­ied agent pos­si­bly func­tion? Also sur­prised that Pearl’s sem­i­nal work on Causal­ity (incl. Coun­ter­fac­tu­als) is not cited.