# Embedded Agency (full-text version)

Sup­pose you want to build a robot to achieve some real-world goal for you—a goal that re­quires the robot to learn for it­self and figure out a lot of things that you don’t already know.

There’s a com­pli­cated en­g­ineer­ing prob­lem here. But there’s also a prob­lem of figur­ing out what it even means to build a learn­ing agent like that. What is it to op­ti­mize re­al­is­tic goals in phys­i­cal en­vi­ron­ments? In broad terms, how does it work?

In this post, I’ll point to four ways we don’t cur­rently know how it works, and four ar­eas of ac­tive re­search aimed at figur­ing it out.

### 1. Embed­ded agents

This is Alexei, and Alexei is play­ing a video game.

Like most games, this game has clear in­put and out­put chan­nels. Alexei only ob­serves the game through the com­puter screen, and only ma­nipu­lates the game through the con­trol­ler.

The game can be thought of as a func­tion which takes in a se­quence of but­ton presses and out­puts a se­quence of pix­els on the screen.

Alexei is also very smart, and ca­pa­ble of hold­ing the en­tire video game in­side his mind. If Alexei has any un­cer­tainty, it is only over em­piri­cal facts like what game he is play­ing, and not over log­i­cal facts like which in­puts (for a given de­ter­minis­tic game) will yield which out­puts. This means that Alexei must also store in­side his mind ev­ery pos­si­ble game he could be play­ing.

Alexei does not, how­ever, have to think about him­self. He is only op­ti­miz­ing the game he is play­ing, and not op­ti­miz­ing the brain he is us­ing to think about the game. He may still choose ac­tions based off of value of in­for­ma­tion, but this is only to help him rule out pos­si­ble games he is play­ing, and not to change the way in which he thinks.

In fact, Alexei can treat him­self as an un­chang­ing in­di­visi­ble atom. Since he doesn’t ex­ist in the en­vi­ron­ment he’s think­ing about, Alexei doesn’t worry about whether he’ll change over time, or about any sub­rou­tines he might have to run.

No­tice that all the prop­er­ties I talked about are par­tially made pos­si­ble by the fact that Alexei is cleanly sep­a­rated from the en­vi­ron­ment that he is op­ti­miz­ing.

This is Emmy. Emmy is play­ing real life.

Real life is not like a video game. The differ­ences largely come from the fact that Emmy is within the en­vi­ron­ment that she is try­ing to op­ti­mize.

Alexei sees the uni­verse as a func­tion, and he op­ti­mizes by choos­ing in­puts to that func­tion that lead to greater re­ward than any of the other pos­si­ble in­puts he might choose. Emmy, on the other hand, doesn’t have a func­tion. She just has an en­vi­ron­ment, and this en­vi­ron­ment con­tains her.

Emmy wants to choose the best pos­si­ble ac­tion, but which ac­tion Emmy chooses to take is just an­other fact about the en­vi­ron­ment. Emmy can rea­son about the part of the en­vi­ron­ment that is her de­ci­sion, but since there’s only one ac­tion that Emmy ends up ac­tu­ally tak­ing, it’s not clear what it even means for Emmy to “choose” an ac­tion that is bet­ter than the rest.

Alexei can poke the uni­verse and see what hap­pens. Emmy is the uni­verse pok­ing it­self. In Emmy’s case, how do we for­mal­ize the idea of “choos­ing” at all?

To make mat­ters worse, since Emmy is con­tained within the en­vi­ron­ment, Emmy must also be smaller than the en­vi­ron­ment. This means that Emmy is in­ca­pable of stor­ing ac­cu­rate de­tailed mod­els of the en­vi­ron­ment within her mind.

This causes a prob­lem: Bayesian rea­son­ing works by start­ing with a large col­lec­tion of pos­si­ble en­vi­ron­ments, and as you ob­serve facts that are in­con­sis­tent with some of those en­vi­ron­ments, you rule them out. What does rea­son­ing look like when you’re not even ca­pa­ble of stor­ing a sin­gle valid hy­poth­e­sis for the way the world works? Emmy is go­ing to have to use a differ­ent type of rea­son­ing, and make up­dates that don’t fit into the stan­dard Bayesian frame­work.

Since Emmy is within the en­vi­ron­ment that she is ma­nipu­lat­ing, she is also go­ing to be ca­pa­ble of self-im­prove­ment. But how can Emmy be sure that as she learns more and finds more and more ways to im­prove her­self, she only changes her­self in ways that are ac­tu­ally helpful? How can she be sure that she won’t mod­ify her origi­nal goals in un­de­sir­able ways?

Fi­nally, since Emmy is con­tained within the en­vi­ron­ment, she can’t treat her­self like an atom. She is made out of the same pieces that the rest of the en­vi­ron­ment is made out of, which is what causes her to be able to think about her­self.

In ad­di­tion to haz­ards in her ex­ter­nal en­vi­ron­ment, Emmy is go­ing to have to worry about threats com­ing from within. While op­ti­miz­ing, Emmy might spin up other op­ti­miz­ers as sub­rou­tines, ei­ther in­ten­tion­ally or un­in­ten­tion­ally. Th­ese sub­sys­tems can cause prob­lems if they get too pow­er­ful and are un­al­igned with Emmy’s goals. Emmy must figure out how to rea­son with­out spin­ning up in­tel­li­gent sub­sys­tems, or oth­er­wise figure out how to keep them weak, con­tained, or al­igned fully with her goals.

#### 1.1. Dual­is­tic agents

Emmy is con­fus­ing, so let’s go back to Alexei. Mar­cus Hut­ter’s AIXI frame­work gives a good the­o­ret­i­cal model for how agents like Alexei work:

The model has an agent and an en­vi­ron­ment that in­ter­act us­ing ac­tions, ob­ser­va­tions, and re­wards. The agent sends out an ac­tion , and then the en­vi­ron­ment sends out both an ob­ser­va­tion and a re­ward . This pro­cess re­peats at each time .

Each ac­tion is a func­tion of all the pre­vi­ous ac­tion-ob­ser­va­tion-re­ward triples. And each ob­ser­va­tion and re­ward is similarly a func­tion of these triples and the im­me­di­ately pre­ced­ing ac­tion.

You can imag­ine an agent in this frame­work that has full knowl­edge of the en­vi­ron­ment that it’s in­ter­act­ing with. How­ever, AIXI is used to model op­ti­miza­tion un­der un­cer­tainty about the en­vi­ron­ment. AIXI has a dis­tri­bu­tion over all pos­si­ble com­putable en­vi­ron­ments , and chooses ac­tions that lead to a high ex­pected re­ward un­der this dis­tri­bu­tion. Since it also cares about fu­ture re­ward, this may lead to ex­plor­ing for value of in­for­ma­tion.

Un­der some as­sump­tions, we can show that AIXI does rea­son­ably well in all com­putable en­vi­ron­ments, in spite of its un­cer­tainty. How­ever, while the en­vi­ron­ments that AIXI is in­ter­act­ing with are com­putable, AIXI it­self is un­com­putable. The agent is made out of a differ­ent sort of stuff, a more pow­er­ful sort of stuff, than the en­vi­ron­ment.

We will call agents like AIXI and Alexei “du­al­is­tic.” They ex­ist out­side of their en­vi­ron­ment, with only set in­ter­ac­tions be­tween agent-stuff and en­vi­ron­ment-stuff. They re­quire the agent to be larger than the en­vi­ron­ment, and don’t tend to model self-refer­en­tial rea­son­ing, be­cause the agent is made of differ­ent stuff than what the agent rea­sons about.

AIXI is not alone. Th­ese du­al­is­tic as­sump­tions show up all over our cur­rent best the­o­ries of ra­tio­nal agency.

I set up AIXI as a bit of a foil, but AIXI can also be used as in­spira­tion. When I look at AIXI, I feel like I re­ally un­der­stand how Alexei works. This is the kind of un­der­stand­ing that I want to also have for Emmy.

Un­for­tu­nately, Emmy is con­fus­ing. When I talk about want­ing to have a the­ory of “em­bed­ded agency,” I mean I want to be able to un­der­stand the­o­ret­i­cally how agents like Emmy work. That is, agents that are em­bed­ded within their en­vi­ron­ment and thus:

• do not have well-defined i/​o chan­nels;

• are smaller than their en­vi­ron­ment;

• are able to rea­son about them­selves and self-im­prove;

• and are made of parts similar to the en­vi­ron­ment.

You shouldn’t think of these four com­pli­ca­tions as a par­ti­tion. They are very en­tan­gled with each other.

For ex­am­ple, the rea­son the agent is able to self-im­prove is be­cause it is made of parts. And any time the en­vi­ron­ment is suffi­ciently larger than the agent, it might con­tain other copies of the agent, and thus de­stroy any well-defined i/​o chan­nels.

How­ever, I will use these four com­pli­ca­tions to in­spire a split of the topic of em­bed­ded agency into four sub­prob­lems. Th­ese are: de­ci­sion the­ory, em­bed­ded world-mod­els, ro­bust del­e­ga­tion, and sub­sys­tem al­ign­ment.

#### 1.2. Embed­ded subproblems

De­ci­sion the­ory is all about em­bed­ded op­ti­miza­tion.

The sim­plest model of du­al­is­tic op­ti­miza­tion is . takes in a func­tion from ac­tions to re­wards, and re­turns the ac­tion which leads to the high­est re­ward un­der this func­tion. Most op­ti­miza­tion can be thought of as some var­i­ant on this. You have some space; you have a func­tion from this space to some score, like a re­ward or util­ity; and you want to choose an in­put that scores highly un­der this func­tion.

But we just said that a large part of what it means to be an em­bed­ded agent is that you don’t have a func­tional en­vi­ron­ment. So now what do we do? Op­ti­miza­tion is clearly an im­por­tant part of agency, but we can’t cur­rently say what it is even in the­ory with­out mak­ing ma­jor type er­rors.

Some ma­jor open prob­lems in de­ci­sion the­ory in­clude:

• log­i­cal coun­ter­fac­tu­als: how do you rea­son about what would hap­pen if you take ac­tion B, given that you can prove that you will in­stead take ac­tion A?

• en­vi­ron­ments that in­clude mul­ti­ple copies of the agent, or trust­wor­thy pre­dic­tions of the agent.

• log­i­cal up­date­less­ness, which is about how to com­bine the very nice but very Bayesian world of Wei Dai’s up­date­less de­ci­sion the­ory, with the much less Bayesian world of log­i­cal un­cer­tainty.

Embed­ded world-mod­els is about how you can make good mod­els of the world that are able to fit within an agent that is much smaller than the world.

This has proven to be very difficult—first, be­cause it means that the true uni­verse is not in your hy­poth­e­sis space, which ru­ins a lot of the­o­ret­i­cal guaran­tees; and sec­ond, be­cause it means we’re go­ing to have to make non-Bayesian up­dates as we learn, which also ru­ins a bunch of the­o­ret­i­cal guaran­tees.

It is also about how to make world-mod­els from the point of view of an ob­server on the in­side, and re­sult­ing prob­lems such as an­throp­ics. Some ma­jor open prob­lems in em­bed­ded world-mod­els in­clude:

• log­i­cal un­cer­tainty, which is about how to com­bine the world of logic with the world of prob­a­bil­ity.

• multi-level mod­el­ing, which is about how to have mul­ti­ple mod­els of the same world at differ­ent lev­els of de­scrip­tion, and tran­si­tion nicely be­tween them.

• on­tolog­i­cal crises, which is what to do when you re­al­ize that your model, or even your goal, was speci­fied us­ing a differ­ent on­tol­ogy than the real world.

Ro­bust del­e­ga­tion is all about a spe­cial type of prin­ci­pal-agent prob­lem. You have an ini­tial agent that wants to make a more in­tel­li­gent suc­ces­sor agent to help it op­ti­mize its goals. The ini­tial agent has all of the power, be­cause it gets to de­cide ex­actly what suc­ces­sor agent to make. But in an­other sense, the suc­ces­sor agent has all of the power, be­cause it is much, much more in­tel­li­gent.

From the point of view of the ini­tial agent, the ques­tion is about cre­at­ing a suc­ces­sor that will ro­bustly not use its in­tel­li­gence against you. From the point of view of the suc­ces­sor agent, the ques­tion is about, “How do you ro­bustly learn or re­spect the goals of some­thing that is stupid, ma­nipu­la­ble, and not even us­ing the right on­tol­ogy?”

There are ex­tra prob­lems com­ing from the Löbian ob­sta­cle mak­ing it im­pos­si­ble to con­sis­tently trust things that are more pow­er­ful than you.

You can think about these prob­lems in the con­text of an agent that’s just learn­ing over time, or in the con­text of an agent mak­ing a sig­nifi­cant self-im­prove­ment, or in the con­text of an agent that’s just try­ing to make a pow­er­ful tool.

The ma­jor open prob­lems in ro­bust del­e­ga­tion in­clude:

• Vingean re­flec­tion, which is about how to rea­son about and trust agents that are much smarter than you, in spite of the Löbian ob­sta­cle to trust.

• value learn­ing, which is how the suc­ces­sor agent can learn the goals of the ini­tial agent in spite of that agent’s stu­pidity and in­con­sis­ten­cies.

• cor­rigi­bil­ity, which is about how an ini­tial agent can get a suc­ces­sor agent to al­low (or even help with) mod­ifi­ca­tions, in spite of an in­stru­men­tal in­cen­tive not to.

Sub­sys­tem al­ign­ment is about how to be one unified agent that doesn’t have sub­sys­tems that are fight­ing against ei­ther you or each other.

When an agent has a goal, like “sav­ing the world,” it might end up spend­ing a large amount of its time think­ing about a sub­goal, like “mak­ing money.” If the agent spins up a sub-agent that is only try­ing to make money, there are now two agents that have differ­ent goals, and this leads to a con­flict. The sub-agent might sug­gest plans that look like they only make money, but ac­tu­ally de­stroy the world in or­der to make even more money.

The prob­lem is: you don’t just have to worry about sub-agents that you in­ten­tion­ally spin up. You also have to worry about spin­ning up sub-agents by ac­ci­dent. Any time you perform a search or an op­ti­miza­tion over a suffi­ciently rich space that’s able to con­tain agents, you have to worry about the space it­self do­ing op­ti­miza­tion. This op­ti­miza­tion may not be ex­actly in line with the op­ti­miza­tion the outer sys­tem was try­ing to do, but it will have an in­stru­men­tal in­cen­tive to look like it’s al­igned.

A lot of op­ti­miza­tion in prac­tice uses this kind of pass­ing the buck. You don’t just find a solu­tion; you find a thing that is able to it­self search for a solu­tion.

In the­ory, I don’t un­der­stand how to do op­ti­miza­tion at all—other than meth­ods that look like find­ing a bunch of stuff that I don’t un­der­stand, and see­ing if it ac­com­plishes my goal. But this is ex­actly the kind of thing that’s most prone to spin­ning up ad­ver­sar­ial sub­sys­tems.

The big open prob­lem in sub­sys­tem al­ign­ment is about how to have a base-level op­ti­mizer that doesn’t spin up ad­ver­sar­ial op­ti­miz­ers. You can break this prob­lem up fur­ther by con­sid­er­ing cases where the re­sul­tant op­ti­miz­ers are ei­ther in­ten­tional or un­in­ten­tional, and con­sid­er­ing re­stricted sub­classes of op­ti­miza­tion, like in­duc­tion.

But re­mem­ber: de­ci­sion the­ory, em­bed­ded world-mod­els, ro­bust del­e­ga­tion, and sub­sys­tem al­ign­ment are not four sep­a­rate prob­lems. They’re all differ­ent sub­prob­lems of the same unified con­cept that is em­bed­ded agency.

### 2. De­ci­sion theory

De­ci­sion the­ory and ar­tifi­cial in­tel­li­gence typ­i­cally try to com­pute some­thing resembling

I.e., max­i­mize some func­tion of the ac­tion. This tends to as­sume that we can de­tan­gle things enough to see out­comes as a func­tion of ac­tions.

For ex­am­ple, AIXI rep­re­sents the agent and the en­vi­ron­ment as sep­a­rate units which in­ter­act over time through clearly defined i/​o chan­nels, so that it can then choose ac­tions max­i­miz­ing re­ward.

When the agent model is a part of the en­vi­ron­ment model, it can be sig­nifi­cantly less clear how to con­sider tak­ing al­ter­na­tive ac­tions.

For ex­am­ple, be­cause the agent is smaller than the en­vi­ron­ment, there can be other copies of the agent, or things very similar to the agent. This leads to con­tentious de­ci­sion-the­ory prob­lems such as the Twin Pri­soner’s Dilemma and New­comb’s prob­lem.

If Emmy Model 1 and Emmy Model 2 have had the same ex­pe­riences and are run­ning the same source code, should Emmy Model 1 act like her de­ci­sions are steer­ing both robots at once? Depend­ing on how you draw the bound­ary around “your­self”, you might think you con­trol the ac­tion of both copies, or only your own.

This is an in­stance of the prob­lem of coun­ter­fac­tual rea­son­ing: how do we eval­u­ate hy­po­thet­i­cals like “What if the sun sud­denly went out”?

Prob­lems of adapt­ing de­ci­sion the­ory to em­bed­ded agents in­clude:

• counterfactuals

• New­comblike rea­son­ing, in which the agent in­ter­acts with copies of itself

• ex­tor­tion problems

• co­or­di­na­tion problems

• log­i­cal counterfactuals

• log­i­cal updatelessness

#### 2.1. Ac­tion counterfactuals

The most cen­tral ex­am­ple of why agents need to think about coun­ter­fac­tu­als comes from coun­ter­fac­tu­als about their own ac­tions.

The difficulty with ac­tion coun­ter­fac­tu­als can be illus­trated by the five-and-ten prob­lem. Sup­pose we have the op­tion of tak­ing a five dol­lar bill or a ten dol­lar bill, and all we care about in the situ­a­tion is how much money we get. Ob­vi­ously, we should take the $10. How­ever, it is not so easy as it seems to re­li­ably take the$10.

If you rea­son about your­self as just an­other part of the en­vi­ron­ment, then you can know your own be­hav­ior. If you can know your own be­hav­ior, then it be­comes difficult to rea­son about what would hap­pen if you be­haved differ­ently.

This throws a mon­key wrench into many com­mon rea­son­ing meth­ods. How do we for­mal­ize the idea “Tak­ing the $10 would lead to good con­se­quences, while tak­ing the$5 would lead to bad con­se­quences,” when suffi­ciently rich self-knowl­edge would re­veal one of those sce­nar­ios as in­con­sis­tent?

And if we can’t for­mal­ize any idea like that, how do real-world agents figure out to take the $10 any­way? If we try to calcu­late the ex­pected util­ity of our ac­tions by Bayesian con­di­tion­ing, as is com­mon, know­ing our own be­hav­ior leads to a di­vide-by-zero er­ror when we try to calcu­late the ex­pected util­ity of ac­tions we know we don’t take: im­plies , which im­plies , which implies Be­cause the agent doesn’t know how to sep­a­rate it­self from the en­vi­ron­ment, it gets gnash­ing in­ter­nal gears when it tries to imag­ine tak­ing differ­ent ac­tions. But the biggest com­pli­ca­tion comes from Löb’s The­o­rem, which can make oth­er­wise rea­son­able-look­ing agents take the$5 be­cause “If I take the $10, I get$0”! And in a sta­ble way—the prob­lem can’t be solved by the agent learn­ing or think­ing about the prob­lem more.

This might be hard to be­lieve; so let’s look at a de­tailed ex­am­ple. The phe­nomenon can be illus­trated by the be­hav­ior of sim­ple logic-based agents rea­son­ing about the five-and-ten prob­lem.

Con­sider this ex­am­ple:

We have the source code for an agent and the uni­verse. They can re­fer to each other through the use of quin­ing. The uni­verse is sim­ple; the uni­verse just out­puts what­ever the agent out­puts.

The agent spends a long time search­ing for proofs about what hap­pens if it takes var­i­ous ac­tions. If for some and equal to , , or , it finds a proof that tak­ing the leads to util­ity, that tak­ing the leads to util­ity, and that , it will nat­u­rally take the . We ex­pect that it won’t find such a proof, and will in­stead pick the de­fault ac­tion of tak­ing the .

It seems easy when you just imag­ine an agent try­ing to rea­son about the uni­verse. Yet it turns out that if the amount of time spent search­ing for proofs is enough, the agent will always choose !

The proof that this is so is by Löb’s the­o­rem. Löb’s the­o­rem says that, for any propo­si­tion , if you can prove that a proof of would im­ply the truth of , then you can prove . In sym­bols, with “” mean­ing ” is prov­able”:

In the ver­sion of the five-and-ten prob­lem I gave, “” is the propo­si­tion “if the agent out­puts the uni­verse out­puts , and if the agent out­puts the uni­verse out­puts ”.

Sup­pos­ing it is prov­able, the agent will even­tu­ally find the proof, and re­turn in fact. This makes the sen­tence true, since the agent out­puts and the uni­verse out­puts , and since it’s false that the agent out­puts . This is be­cause false propo­si­tions like “the agent out­puts ” im­ply ev­ery­thing, in­clud­ing the uni­verse out­putting .

The agent can (given enough time) prove all of this, in which case the agent in fact proves the propo­si­tion “if the agent out­puts the uni­verse out­puts , and if the agent out­puts the uni­verse out­puts ”. And as a re­sult, the agent takes the $5. We call this a “spu­ri­ous proof”: the agent takes the$5 be­cause it can prove that if it takes the $10 it has low value, be­cause it takes the$5. It sounds cir­cu­lar, but sadly, is log­i­cally cor­rect. More gen­er­ally, when work­ing in less proof-based set­tings, we re­fer to this as a prob­lem of spu­ri­ous coun­ter­fac­tu­als.

The gen­eral pat­tern is: coun­ter­fac­tu­als may spu­ri­ously mark an ac­tion as not be­ing very good. This makes the AI not take the ac­tion. Depend­ing on how the coun­ter­fac­tu­als work, this may re­move any feed­back which would “cor­rect” the prob­le­matic coun­ter­fac­tual; or, as we saw with proof-based rea­son­ing, it may ac­tively help the spu­ri­ous coun­ter­fac­tual be “true”.

Note that be­cause the proof-based ex­am­ples are of sig­nifi­cant in­ter­est to us, “coun­ter­fac­tu­als” ac­tu­ally have to be coun­ter­log­i­cals; we some­times need to rea­son about log­i­cally im­pos­si­ble “pos­si­bil­ities”. This rules out most ex­ist­ing ac­counts of coun­ter­fac­tual rea­son­ing.

You may have no­ticed that I slightly cheated. The only thing that broke the sym­me­try and caused the agent to take the $5 was the fact that “” was the ac­tion that was taken when a proof was found, and “” was the de­fault. We could in­stead con­sider an agent that looks for any proof at all about what ac­tions lead to what util­ities, and then takes the ac­tion that is bet­ter. This way, which ac­tion is taken is de­pen­dent on what or­der we search for proofs. Let’s as­sume we search for short proofs first. In this case, we will take the$10, since it is very easy to show that leads to and leads to .

The prob­lem is that spu­ri­ous proofs can be short too, and don’t get much longer when the uni­verse gets harder to pre­dict. If we re­place the uni­verse with one that is prov­ably func­tion­ally the same, but is harder to pre­dict, the short­est proof will short-cir­cuit the com­pli­cated uni­verse and be spu­ri­ous.

Peo­ple of­ten try to solve the prob­lem of coun­ter­fac­tu­als by sug­gest­ing that there will always be some un­cer­tainty. An AI may know its source code perfectly, but it can’t perfectly know the hard­ware it is run­ning on.

Does adding a lit­tle un­cer­tainty solve the prob­lem? Often not:

• The proof of the spu­ri­ous coun­ter­fac­tual of­ten still goes through; if you think you are in a five-and-ten prob­lem with a 95% cer­tainty, you can have the usual prob­lem within that 95%.

• Ad­ding un­cer­tainty to make coun­ter­fac­tu­als well-defined doesn’t get you any guaran­tee that the coun­ter­fac­tu­als will be rea­son­able. Hard­ware failures aren’t of­ten what you want to ex­pect when con­sid­er­ing al­ter­nate ac­tions.

Con­sider this sce­nario: You are con­fi­dent that you al­most always take the left path. How­ever, it is pos­si­ble (though un­likely) for a cos­mic ray to dam­age your cir­cuits, in which case you could go right—but you would then be in­sane, which would have many other bad con­se­quences.

If this rea­son­ing in it­self is why you always go left, you’ve gone wrong.

Sim­ply en­sur­ing that the agent has some un­cer­tainty about its ac­tions doesn’t en­sure that the agent will have re­motely rea­son­able coun­ter­fac­tual ex­pec­ta­tions. How­ever, one thing we can try in­stead is to en­sure the agent ac­tu­ally takes each ac­tion with some prob­a­bil­ity. This strat­egy is called ε-ex­plo­ra­tion.

ε-ex­plo­ra­tion en­sures that if an agent plays similar games on enough oc­ca­sions, it can even­tu­ally learn re­al­is­tic coun­ter­fac­tu­als (mod­ulo a con­cern of re­al­iz­abil­ity which we will get to later).

ε-ex­plo­ra­tion only works if it en­sures that the agent it­self can’t pre­dict whether it is about to ε-ex­plore. In fact, a good way to im­ple­ment ε-ex­plo­ra­tion is via the rule “if the agent is too sure about its ac­tion, it takes a differ­ent one”.

From a log­i­cal per­spec­tive, the un­pre­dictabil­ity of ε-ex­plo­ra­tion is what pre­vents the prob­lems we’ve been dis­cussing. From a learn­ing-the­o­retic per­spec­tive, if the agent could know it wasn’t about to ex­plore, then it could treat that as a differ­ent case—failing to gen­er­al­ize les­sons from its ex­plo­ra­tion. This gets us back to a situ­a­tion where we have no guaran­tee that the agent will learn bet­ter coun­ter­fac­tu­als. Ex­plo­ra­tion may be the only source of data for some ac­tions, so we need to force the agent to take that data into ac­count, or it may not learn.

How­ever, even ε-ex­plo­ra­tion doesn’t seem to get things ex­actly right. Ob­serv­ing the re­sult of ε-ex­plo­ra­tion shows you what hap­pens if you take an ac­tion un­pre­dictably; the con­se­quences of tak­ing that ac­tion as part of busi­ness-as-usual may be differ­ent.

Sup­pose you’re an ε-ex­plorer who lives in a world of ε-ex­plor­ers. You’re ap­ply­ing for a job as a se­cu­rity guard, and you need to con­vince the in­ter­viewer that you’re not the kind of per­son who would run off with the stuff you’re guard­ing. They want to hire some­one who has too much in­tegrity to lie and steal, even if they thought they could get away with it.

Sup­pose the in­ter­viewer is an amaz­ing judge of char­ac­ter—or just has read ac­cess to your source code.

In this situ­a­tion, steal­ing might be a great op­tion as an ε-ex­plo­ra­tion ac­tion, be­cause the in­ter­viewer may not be able to pre­dict your theft, or may not think pun­ish­ment makes sense for a one-off anomaly.

But steal­ing is clearly a bad idea as a nor­mal ac­tion, be­cause you’ll be seen as much less re­li­able and trust­wor­thy.

#### 2.2. View­ing the prob­lem from outside

If we don’t learn coun­ter­fac­tu­als from ε-ex­plo­ra­tion, then, it seems we have no guaran­tee of learn­ing re­al­is­tic coun­ter­fac­tu­als at all. But if we do learn from ε-ex­plo­ra­tion, it ap­pears we still get things wrong in some cases.

Switch­ing to a prob­a­bil­is­tic set­ting doesn’t cause the agent to re­li­ably make “rea­son­able” choices, and nei­ther does forced ex­plo­ra­tion.

But writ­ing down ex­am­ples of “cor­rect” coun­ter­fac­tual rea­son­ing doesn’t seem hard from the out­side!

Maybe that’s be­cause from “out­side” we always have a du­al­is­tic per­spec­tive. We are in fact sit­ting out­side of the prob­lem, and we’ve defined it as a func­tion of an agent.

How­ever, an agent can’t solve the prob­lem in the same way from in­side. From its per­spec­tive, its func­tional re­la­tion­ship with the en­vi­ron­ment isn’t an ob­serv­able fact. This is why coun­ter­fac­tu­als are called “coun­ter­fac­tu­als”, af­ter all.

When I told you about the 5 and 10 prob­lem, I first told you about the prob­lem, and then gave you an agent. When one agent doesn’t work well, we could con­sider a differ­ent agent.

Find­ing a way to suc­ceed at a de­ci­sion prob­lem in­volves find­ing an agent that when plugged into the prob­lem takes the right ac­tion. The fact that we can even con­sider putting in differ­ent agents means that we have already carved the uni­verse into an “agent” part, plus the rest of the uni­verse with a hole for the agent—which is most of the work!

Are we just fool­ing our­selves due to the way we set up de­ci­sion prob­lems, then? Are there no “cor­rect” coun­ter­fac­tu­als?

Well, maybe we are fool­ing our­selves. But there is still some­thing we are con­fused about! “Coun­ter­fac­tu­als are sub­jec­tive, in­vented by the agent” doesn’t dis­solve the mys­tery. There is some­thing in­tel­li­gent agents do, in the real world, to make de­ci­sions.

So I’m not talk­ing about agents who know their own ac­tions be­cause I think there’s go­ing to be a big prob­lem with in­tel­li­gent ma­chines in­fer­ring their own ac­tions in the fu­ture. Rather, the pos­si­bil­ity of know­ing your own ac­tions illus­trates some­thing con­fus­ing about de­ter­min­ing the con­se­quences of your ac­tions—a con­fu­sion which shows up even in the very sim­ple case where ev­ery­thing about the world is known and you just need to choose the larger pile of money.

For all that, hu­mans don’t seem to run into any trou­ble tak­ing the $10. Can we take any in­spira­tion from how hu­mans make de­ci­sions? Well, sup­pose you’re ac­tu­ally asked to choose be­tween$10 and $5. You know that you’ll take the$10. How do you rea­son about what would hap­pen if you took the $5 in­stead? It seems easy if you can sep­a­rate your­self from the world, so that you only think of ex­ter­nal con­se­quences (get­ting$5).

If you think about your­self as well, the coun­ter­fac­tual starts seem­ing a bit more strange or con­tra­dic­tory. Maybe you have some ab­surd pre­dic­tion about what the world would be like if you took the $5—like, “I’d have to be blind!” That’s alright, though. In the end you still see that tak­ing the$5 would lead to bad con­se­quences, and you still take the $10, so you’re do­ing fine. The challenge for for­mal agents is that an agent can be in a similar po­si­tion, ex­cept it is tak­ing the$5, knows it is tak­ing the $5, and can’t figure out that it should be tak­ing the$10 in­stead, be­cause of the ab­surd pre­dic­tions it makes about what hap­pens when it takes the $10. It seems hard for a hu­man to end up in a situ­a­tion like that; yet when we try to write down a for­mal rea­soner, we keep run­ning into this kind of prob­lem. So it in­deed seems like hu­man de­ci­sion-mak­ing is do­ing some­thing here that we don’t yet un­der­stand. #### 2.3. New­comblike problems If you’re an em­bed­ded agent, then you should be able to think about your­self, just like you think about other ob­jects in the en­vi­ron­ment. And other rea­son­ers in your en­vi­ron­ment should be able to think about you too. In the five-and-ten prob­lem, we saw how messy things can get when an agent knows its own ac­tion be­fore it acts. But this is hard to avoid for an em­bed­ded agent. It’s es­pe­cially hard not to know your own ac­tion in stan­dard Bayesian set­tings, which as­sume log­i­cal om­ni­science. A prob­a­bil­ity dis­tri­bu­tion as­signs prob­a­bil­ity 1 to any fact which is log­i­cally true. So if a Bayesian agent knows its own source code, then it should know its own ac­tion. How­ever, re­al­is­tic agents who are not log­i­cally om­ni­scient may run into the same prob­lem. Log­i­cal om­ni­science forces the is­sue, but re­ject­ing log­i­cal om­ni­science doesn’t elimi­nate the is­sue. ε-ex­plo­ra­tion does seem to solve that prob­lem in many cases, by en­sur­ing that agents have un­cer­tainty about their choices and that the things they ex­pect are based on ex­pe­rience. How­ever, as we saw in the se­cu­rity guard ex­am­ple, even ε-ex­plo­ra­tion seems to steer us wrong when the re­sults of ex­plor­ing ran­domly differ from the re­sults of act­ing re­li­ably. Ex­am­ples which go wrong in this way seem to in­volve an­other part of the en­vi­ron­ment that be­haves like you—such as an­other agent very similar to your­self, or a suffi­ciently good model or simu­la­tion of you. Th­ese are called New­comblike prob­lems; an ex­am­ple is the Twin Pri­soner’s Dilemma men­tioned above. If the five-and-ten prob­lem is about cut­ting a you-shaped piece out of the world so that the world can be treated as a func­tion of your ac­tion, New­comblike prob­lems are about what to do when there are sev­eral ap­prox­i­mately you-shaped pieces in the world. One idea is that ex­act copies should be treated as 100% un­der your “log­i­cal con­trol”. For ap­prox­i­mate mod­els of you, or merely similar agents, con­trol should drop off sharply as log­i­cal cor­re­la­tion de­creases. But how does this work? New­comblike prob­lems are difficult for al­most the same rea­son as the self-refer­ence is­sues dis­cussed so far: pre­dic­tion. With strate­gies such as ε-ex­plo­ra­tion, we tried to limit the self-knowl­edge of the agent in an at­tempt to avoid trou­ble. But the pres­ence of pow­er­ful pre­dic­tors in the en­vi­ron­ment rein­tro­duces the trou­ble. By choos­ing what in­for­ma­tion to share, pre­dic­tors can ma­nipu­late the agent and choose their ac­tions for them. If there is some­thing which can pre­dict you, it might tell you its pre­dic­tion, or re­lated in­for­ma­tion, in which case it mat­ters what you do in re­sponse to var­i­ous things you could find out. Sup­pose you de­cide to do the op­po­site of what­ever you’re told. Then it isn’t pos­si­ble for the sce­nario to be set up in the first place. Either the pre­dic­tor isn’t ac­cu­rate af­ter all, or al­ter­na­tively, the pre­dic­tor doesn’t share their pre­dic­tion with you. On the other hand, sup­pose there’s some situ­a­tion where you do act as pre­dicted. Then the pre­dic­tor can con­trol how you’ll be­have, by con­trol­ling what pre­dic­tion they tell you. So, on the one hand, a pow­er­ful pre­dic­tor can con­trol you by se­lect­ing be­tween the con­sis­tent pos­si­bil­ities. On the other hand, you are the one who chooses your pat­tern of re­sponses in the first place, which means that you can set them up to your best ad­van­tage. #### 2.4. Ob­ser­va­tion counterfactuals So far, we’ve been dis­cussing ac­tion coun­ter­fac­tu­als—how to an­ti­ci­pate con­se­quences of differ­ent ac­tions. This dis­cus­sion of con­trol­ling your re­sponses in­tro­duces the ob­ser­va­tion coun­ter­fac­tual—imag­in­ing what the world would be like if differ­ent facts had been ob­served. Even if there is no one tel­ling you a pre­dic­tion about your fu­ture be­hav­ior, ob­ser­va­tion coun­ter­fac­tu­als can still play a role in mak­ing the right de­ci­sion. Con­sider the fol­low­ing game: Alice re­ceives a card at ran­dom which is ei­ther High or Low. She may re­veal the card if she wishes. Bob then gives his prob­a­bil­ity that Alice has a high card. Alice always loses dol­lars. Bob loses if the card is low, and if the card is high. Bob has a proper scor­ing rule, so does best by giv­ing his true be­lief. Alice just wants Bob’s be­lief to be as much to­ward “low” as pos­si­ble. Sup­pose Alice will play only this one time. She sees a low card. Bob is good at rea­son­ing about Alice, but is in the next room and so can’t read any tells. Should Alice re­veal her card? Since Alice’s card is low, if she shows it to Bob, she will lose no money, which is the best pos­si­ble out­come. How­ever, this means that in the coun­ter­fac­tual world where Alice sees a high card, she wouldn’t be able to keep the se­cret—she might as well show her card in that case too, since her re­luc­tance to show it would be as re­li­able a sign of “high”. On the other hand, if Alice doesn’t show her card, she loses 25¢—but then she can use the same strat­egy in the other world, rather than los­ing$1. So, be­fore play­ing the game, Alice would want to visi­bly com­mit to not re­veal; this makes ex­pected loss 25¢, whereas the other strat­egy has ex­pected loss 50¢. By tak­ing ob­ser­va­tion coun­ter­fac­tu­als into ac­count, Alice is able to keep se­crets—with­out them, Bob could perfectly in­fer her card from her ac­tions.

This game is equiv­a­lent to the de­ci­sion prob­lem called coun­ter­fac­tual mug­ging.

Up­date­less de­ci­sion the­ory (UDT) is a pro­posed de­ci­sion the­ory which can keep se­crets in the high/​low card game. UDT does this by recom­mend­ing that the agent do what­ever would have seemed wis­est be­fore—what­ever your ear­lier self would have com­mit­ted to do.

As it hap­pens, UDT also performs well in New­comblike prob­lems.

Could some­thing like UDT be re­lated to what hu­mans are do­ing, if only im­plic­itly, to get good re­sults on de­ci­sion prob­lems? Or, if it’s not, could it still be a good model for think­ing about de­ci­sion-mak­ing?

Un­for­tu­nately, there are still some pretty deep difficul­ties here. UDT is an el­e­gant solu­tion to a fairly broad class of de­ci­sion prob­lems, but it only makes sense if the ear­lier self can fore­see all pos­si­ble situ­a­tions.

This works fine in a Bayesian set­ting where the prior already con­tains all pos­si­bil­ities within it­self. How­ever, there may be no way to do this in a re­al­is­tic em­bed­ded set­ting. An agent has to be able to think of new pos­si­bil­ities—mean­ing that its ear­lier self doesn’t know enough to make all the de­ci­sions.

And with that, we find our­selves squarely fac­ing the prob­lem of em­bed­ded world-mod­els.

### 3. Embed­ded world-models

An agent which is larger than its en­vi­ron­ment can:

• Hold an ex­act model of the en­vi­ron­ment in its head.

• Think through the con­se­quences of ev­ery po­ten­tial course of ac­tion.

• If it doesn’t know the en­vi­ron­ment perfectly, hold ev­ery pos­si­ble way the en­vi­ron­ment could be in its head, as is the case with Bayesian un­cer­tainty.

All of these are typ­i­cal of no­tions of ra­tio­nal agency.

An em­bed­ded agent can’t do any of those things, at least not in any straight­for­ward way.

One difficulty is that, since the agent is part of the en­vi­ron­ment, mod­el­ing the en­vi­ron­ment in ev­ery de­tail would re­quire the agent to model it­self in ev­ery de­tail, which would re­quire the agent’s self-model to be as “big” as the whole agent. An agent can’t fit in­side its own head.

The lack of a crisp agent/​en­vi­ron­ment bound­ary forces us to grap­ple with para­doxes of self-refer­ence. As if rep­re­sent­ing the rest of the world weren’t already hard enough.

Embed­ded World-Models have to rep­re­sent the world in a way more ap­pro­pri­ate for em­bed­ded agents. Prob­lems in this cluster in­clude:

• the “re­al­iz­abil­ity” /​ “grain of truth” prob­lem: the real world isn’t in the agent’s hy­poth­e­sis space

• log­i­cal uncertainty

• high-level models

• multi-level models

• on­tolog­i­cal crises

• nat­u­ral­ized in­duc­tion, the prob­lem that the agent must in­cor­po­rate its model of it­self into its world-model

• an­thropic rea­son­ing, the prob­lem of rea­son­ing with how many copies of your­self exist

#### 3.1. Realizability

In a Bayesian set­ting, where an agent’s un­cer­tainty is quan­tified by a prob­a­bil­ity dis­tri­bu­tion over pos­si­ble wor­lds, a com­mon as­sump­tion is “re­al­iz­abil­ity”: the true un­der­ly­ing en­vi­ron­ment which is gen­er­at­ing the ob­ser­va­tions is as­sumed to have at least some prob­a­bil­ity in the prior.

In game the­ory, this same prop­erty is de­scribed by say­ing a prior has a “grain of truth”. It should be noted, though, that there are ad­di­tional bar­ri­ers to get­ting this prop­erty in a game-the­o­retic set­ting; so, in their com­mon us­age cases, “grain of truth” is tech­ni­cally de­mand­ing while “re­al­iz­abil­ity” is a tech­ni­cal con­ve­nience.

Real­iz­abil­ity is not to­tally nec­es­sary in or­der for Bayesian rea­son­ing to make sense. If you think of a set of hy­pothe­ses as “ex­perts”, and the cur­rent pos­te­rior prob­a­bil­ity as how much you “trust” each ex­pert, then learn­ing ac­cord­ing to Bayes’ Law, , en­sures a rel­a­tive bounded loss prop­erty.

Speci­fi­cally, if you use a prior , the amount worse you are in com­par­i­son to each ex­pert is at most , since you as­sign at least prob­a­bil­ity to see­ing a se­quence of ev­i­dence . In­tu­itively, is your ini­tial trust in ex­pert , and in each case where it is even a lit­tle bit more cor­rect than you, you in­crease your trust ac­cord­ingly. The way you do this en­sures you as­sign an ex­pert prob­a­bil­ity 1 and hence copy it pre­cisely be­fore you lose more than com­pared to it.

The prior AIXI is based on is the Solomonoff prior. It is defined as the out­put of a uni­ver­sal Tur­ing ma­chine (UTM) whose in­puts are coin-flips.

In other words, feed a UTM a ran­dom pro­gram. Nor­mally, you’d think of a UTM as only be­ing able to simu­late de­ter­minis­tic ma­chines. Here, how­ever, the ini­tial in­puts can in­struct the UTM to use the rest of the in­finite in­put tape as a source of ran­dom­ness to simu­late a stochas­tic Tur­ing ma­chine.

Com­bin­ing this with the pre­vi­ous idea about view­ing Bayesian learn­ing as a way of al­lo­cat­ing “trust” to “ex­perts” which meets a bounded loss con­di­tion, we can see the Solomonoff prior as a kind of ideal ma­chine learn­ing al­gorithm which can learn to act like any al­gorithm you might come up with, no mat­ter how clever.

For this rea­son, we shouldn’t nec­es­sar­ily think of AIXI as “as­sum­ing the world is com­putable”, even though it rea­sons via a prior over com­pu­ta­tions. It’s get­ting bounded loss on its pre­dic­tive ac­cu­racy as com­pared with any com­putable pre­dic­tor. We should rather say that AIXI as­sumes all pos­si­ble al­gorithms are com­putable, not that the world is.

How­ever, lack­ing re­al­iz­abil­ity can cause trou­ble if you are look­ing for any­thing more than bounded-loss pre­dic­tive ac­cu­racy:

• the pos­te­rior can os­cillate for­ever;

• prob­a­bil­ities may not be cal­ibrated;

• es­ti­mates of statis­tics such as the mean may be ar­bi­trar­ily bad;

• es­ti­mates of la­tent vari­ables may be bad;

• and the iden­ti­fi­ca­tion of causal struc­ture may not work.

So does AIXI perform well with­out a re­al­iz­abil­ity as­sump­tion? We don’t know. De­spite get­ting bounded loss for pre­dic­tions with­out re­al­iz­abil­ity, ex­ist­ing op­ti­mal­ity re­sults for its ac­tions re­quire an added re­al­iz­abil­ity as­sump­tion.

First, if the en­vi­ron­ment re­ally is sam­pled from the Solomonoff dis­tri­bu­tion, AIXI gets the max­i­mum ex­pected re­ward. But this is fairly triv­ial; it is es­sen­tially the defi­ni­tion of AIXI.

Se­cond, if we mod­ify AIXI to take some­what ran­dom­ized ac­tions—Thomp­son sam­pling—there is an asymp­totic op­ti­mal­ity re­sult for en­vi­ron­ments which act like any stochas­tic Tur­ing ma­chine.

So, ei­ther way, re­al­iz­abil­ity was as­sumed in or­der to prove any­thing. (See Jan Leike, Non­para­met­ric Gen­eral Re­in­force­ment Learn­ing.)

But the con­cern I’m point­ing at is not “the world might be un­com­putable, so we don’t know if AIXI will do well”; this is more of an illus­tra­tive case. The con­cern is that AIXI is only able to define in­tel­li­gence or ra­tio­nal­ity by con­struct­ing an agent much, much big­ger than the en­vi­ron­ment which it has to learn about and act within.

Lau­rent Orseau pro­vides a way of think­ing about this in “Space-Time Embed­ded In­tel­li­gence”. How­ever, his ap­proach defines the in­tel­li­gence of an agent in terms of a sort of su­per-in­tel­li­gent de­signer who thinks about re­al­ity from out­side, se­lect­ing an agent to place into the en­vi­ron­ment.

Embed­ded agents don’t have the lux­ury of step­ping out­side of the uni­verse to think about how to think. What we would like would be a the­ory of ra­tio­nal be­lief for situ­ated agents which pro­vides foun­da­tions that are similarly as strong as the foun­da­tions Bayesi­anism pro­vides for du­al­is­tic agents.

Imag­ine a com­puter sci­ence the­ory per­son who is hav­ing a dis­agree­ment with a pro­gram­mer. The the­ory per­son is mak­ing use of an ab­stract model. The pro­gram­mer is com­plain­ing that the ab­stract model isn’t some­thing you would ever run, be­cause it is com­pu­ta­tion­ally in­tractable. The the­ory per­son re­sponds that the point isn’t to ever run it. Rather, the point is to un­der­stand some phe­nomenon which will also be rele­vant to more tractable things which you would want to run.

I bring this up in or­der to em­pha­size that my per­spec­tive is a lot more like the the­ory per­son’s. I’m not talk­ing about AIXI to say “AIXI is an ideal­iza­tion you can’t run”. The an­swers to the puz­zles I’m point­ing at don’t need to run. I just want to un­der­stand some phe­nom­ena.

How­ever, some­times a thing that makes some the­o­ret­i­cal mod­els less tractable also makes that model too differ­ent from the phe­nomenon we’re in­ter­ested in.

The way AIXI wins games is by as­sum­ing we can do true Bayesian up­dat­ing over a hy­poth­e­sis space, as­sum­ing the world is in our hy­poth­e­sis space, etc. So it can tell us some­thing about the as­pect of re­al­is­tic agency that’s ap­prox­i­mately do­ing Bayesian up­dat­ing over an ap­prox­i­mately-good-enough hy­poth­e­sis space. But em­bed­ded agents don’t just need ap­prox­i­mate solu­tions to that prob­lem; they need to solve sev­eral prob­lems that are differ­ent in kind from that prob­lem.

#### 3.2. Self-reference

One ma­jor ob­sta­cle a the­ory of em­bed­ded agency must deal with is self-refer­ence.

Para­doxes of self-refer­ence such as the liar para­dox make it not just wildly im­prac­ti­cal, but in a cer­tain sense im­pos­si­ble for an agent’s world-model to ac­cu­rately re­flect the world.

The liar para­dox con­cerns the sta­tus of the sen­tence “This sen­tence is not true”. If it were true, it must be false; and if not true, it must be true.

The difficulty comes in part from try­ing to draw a map of a ter­ri­tory which in­cludes the map it­self.

This is fine if the world “holds still” for us; but be­cause the map is in the world, differ­ent maps cre­ate differ­ent wor­lds.

Sup­pose our goal is to make an ac­cu­rate map of the fi­nal route of a road which is cur­rently un­der con­struc­tion. Sup­pose we also know that the con­struc­tion team will get to see our map, and that con­struc­tion will pro­ceed so as to dis­prove what­ever map we make. This puts us in a liar-para­dox-like situ­a­tion.

Prob­lems of this kind be­come rele­vant for de­ci­sion-mak­ing in the the­ory of games. A sim­ple game of rock-pa­per-scis­sors can in­tro­duce a liar para­dox if the play­ers try to win, and can pre­dict each other bet­ter than chance.

Game the­ory solves this type of prob­lem with game-the­o­retic equil­ibria. But the prob­lem ends up com­ing back in a differ­ent way.

I men­tioned that the prob­lem of re­al­iz­abil­ity takes on a differ­ent char­ac­ter in the con­text of game the­ory. In an ML set­ting, re­al­iz­abil­ity is a po­ten­tially un­re­al­is­tic as­sump­tion, but can usu­ally be as­sumed con­sis­tently nonethe­less.

In game the­ory, on the other hand, the as­sump­tion it­self may be in­con­sis­tent. This is be­cause games com­monly yield para­doxes of self-refer­ence.

Be­cause there are so many agents, it is no longer pos­si­ble in game the­ory to con­ve­niently make an “agent” a thing which is larger than a world. So game the­o­rists are forced to in­ves­ti­gate no­tions of ra­tio­nal agency which can han­dle a large world.

Un­for­tu­nately, this is done by split­ting up the world into “agent” parts and “non-agent” parts, and han­dling the agents in a spe­cial way. This is al­most as bad as du­al­is­tic mod­els of agency.

In rock-pa­per-scis­sors, the liar para­dox is re­solved by stipu­lat­ing that each player play each move with prob­a­bil­ity. If one player plays this way, then the other loses noth­ing by do­ing so. This way of in­tro­duc­ing prob­a­bil­is­tic play to re­solve would-be para­doxes of game the­ory is called a Nash equil­ibrium.

We can use Nash equil­ibria to pre­vent the as­sump­tion that the agents cor­rectly un­der­stand the world they’re in from be­ing in­con­sis­tent. How­ever, that works just by tel­ling the agents what the world looks like. What if we want to model agents who learn about the world, more like AIXI?

The grain of truth prob­lem is the prob­lem of for­mu­lat­ing a rea­son­ably bound prior prob­a­bil­ity dis­tri­bu­tion which would al­low agents play­ing games to place some pos­i­tive prob­a­bil­ity on each other’s true (prob­a­bil­is­tic) be­hav­ior, with­out know­ing it pre­cisely from the start.

Un­til re­cently, known solu­tions to the prob­lem were quite limited. Benja Fallen­stein, Jes­sica Tay­lor, and Paul Chris­ti­ano’s “Reflec­tive Or­a­cles: A Foun­da­tion for Clas­si­cal Game The­ory” pro­vides a very gen­eral solu­tion. For de­tails, see “A For­mal Solu­tion to the Grain of Truth Prob­lem” by Jan Leike, Jes­sica Tay­lor, and Benja Fallen­stein.

You might think that stochas­tic Tur­ing ma­chines can rep­re­sent Nash equil­ibria just fine.

But if you’re try­ing to pro­duce Nash equil­ibria as a re­sult of rea­son­ing about other agents, you’ll run into trou­ble. If each agent mod­els the other’s com­pu­ta­tion and tries to run it to see what the other agent does, you’ve just got an in­finite loop.

There are some ques­tions Tur­ing ma­chines just can’t an­swer—in par­tic­u­lar, ques­tions about the be­hav­ior of Tur­ing ma­chines. The halt­ing prob­lem is the clas­sic ex­am­ple.

Tur­ing stud­ied “or­a­cle ma­chines” to ex­am­ine what would hap­pen if we could an­swer such ques­tions. An or­a­cle is like a book con­tain­ing some an­swers to ques­tions which we were un­able to an­swer be­fore.

But or­di­nar­ily, we get a hi­er­ar­chy. Type B ma­chines can an­swer ques­tions about whether type A ma­chines halt, type C ma­chines have the an­swers about types A and B, and so on, but no ma­chines have an­swers about their own type.

Reflec­tive or­a­cles work by twist­ing the or­di­nary Tur­ing uni­verse back on it­self, so that rather than an in­finite hi­er­ar­chy of ever-stronger or­a­cles, you define an or­a­cle that serves as its own or­a­cle ma­chine.

This would nor­mally in­tro­duce con­tra­dic­tions, but re­flec­tive or­a­cles avoid this by ran­dom­iz­ing their out­put in cases where they would run into para­doxes. So re­flec­tive or­a­cle ma­chines are stochas­tic, but they’re more pow­er­ful than reg­u­lar stochas­tic Tur­ing ma­chines.

That’s how re­flec­tive or­a­cles ad­dress the prob­lems we men­tioned ear­lier of a map that’s it­self part of the ter­ri­tory: ran­dom­ize.

Reflec­tive or­a­cles also solve the prob­lem with game-the­o­retic no­tions of ra­tio­nal­ity I men­tioned ear­lier. It al­lows agents to be rea­soned about in the same man­ner as other parts of the en­vi­ron­ment, rather than treat­ing them as a fun­da­men­tally spe­cial case. They’re all just com­pu­ta­tions-with-or­a­cle-ac­cess.

How­ever, mod­els of ra­tio­nal agents based on re­flec­tive or­a­cles still have sev­eral ma­jor limi­ta­tions. One of these is that agents are re­quired to have un­limited pro­cess­ing power, just like AIXI, and so are as­sumed to know all of the con­se­quences of their own be­liefs.

In fact, know­ing all the con­se­quences of your be­liefs—a prop­erty known as log­i­cal om­ni­science—turns out to be rather core to clas­si­cal Bayesian ra­tio­nal­ity.

#### 3.3. Log­i­cal uncertainty

So far, I’ve been talk­ing in a fairly naive way about the agent hav­ing be­liefs about hy­pothe­ses, and the real world be­ing or not be­ing in the hy­poth­e­sis space.

It isn’t re­ally clear what any of that means.

Depend­ing on how we define things, it may ac­tu­ally be quite pos­si­ble for an agent to be smaller than the world and yet con­tain the right world-model—it might know the true physics and ini­tial con­di­tions, but only be ca­pa­ble of in­fer­ring their con­se­quences very ap­prox­i­mately.

Hu­mans are cer­tainly used to liv­ing with short­hands and ap­prox­i­ma­tions. But re­al­is­tic as this sce­nario may be, it is not in line with what it usu­ally means for a Bayesian to know some­thing. A Bayesian knows the con­se­quences of all of its be­liefs.

Uncer­tainty about the con­se­quences of your be­liefs is log­i­cal un­cer­tainty. In this case, the agent might be em­piri­cally cer­tain of a unique math­e­mat­i­cal de­scrip­tion pin­point­ing which uni­verse she’s in, while be­ing log­i­cally un­cer­tain of most con­se­quences of that de­scrip­tion.

Model­ing log­i­cal un­cer­tainty re­quires us to have a com­bined the­ory of logic (rea­son­ing about im­pli­ca­tions) and prob­a­bil­ity (de­grees of be­lief).

Logic and prob­a­bil­ity the­ory are two great triumphs in the cod­ifi­ca­tion of ra­tio­nal thought. Logic pro­vides the best tools for think­ing about self-refer­ence, while prob­a­bil­ity pro­vides the best tools for think­ing about de­ci­sion-mak­ing. How­ever, the two don’t work to­gether as well as one might think.

They may seem su­perfi­cially com­pat­i­ble, since prob­a­bil­ity the­ory is an ex­ten­sion of Boolean logic. How­ever, Gödel’s first in­com­plete­ness the­o­rem shows that any suffi­ciently rich log­i­cal sys­tem is in­com­plete: not only does it fail to de­cide ev­ery sen­tence as true or false, but it also has no com­putable ex­ten­sion which man­ages to do so.

(See the post “An Un­trol­lable Math­e­mat­i­cian Illus­trated” for more illus­tra­tion of how this messes with prob­a­bil­ity the­ory.)

This also ap­plies to prob­a­bil­ity dis­tri­bu­tions: no com­putable dis­tri­bu­tion can as­sign prob­a­bil­ities in a way that’s con­sis­tent with a suffi­ciently rich the­ory. This forces us to choose be­tween us­ing an un­com­putable dis­tri­bu­tion, or us­ing a dis­tri­bu­tion which is in­con­sis­tent.

Sounds like an easy choice, right? The in­con­sis­tent the­ory is at least com­putable, and we are af­ter all try­ing to de­velop a the­ory of log­i­cal non-om­ni­science. We can just con­tinue to up­date on facts which we prove, bring­ing us closer and closer to con­sis­tency.

Un­for­tu­nately, this doesn’t work out so well, for rea­sons which con­nect back to re­al­iz­abil­ity. Re­mem­ber that there are no com­putable prob­a­bil­ity dis­tri­bu­tions con­sis­tent with all con­se­quences of sound the­o­ries. So our non-om­ni­scient prior doesn’t even con­tain a sin­gle cor­rect hy­poth­e­sis.

This causes patholog­i­cal be­hav­ior as we con­di­tion on more and more true math­e­mat­i­cal be­liefs. Beliefs wildly os­cillate rather than ap­proach­ing rea­son­able es­ti­mates.

Tak­ing a Bayesian prior on math­e­mat­ics, and up­dat­ing on what­ever we prove, does not seem to cap­ture math­e­mat­i­cal in­tu­ition and heuris­tic con­jec­ture very well—un­less we re­strict the do­main and craft a sen­si­ble prior.

Prob­a­bil­ity is like a scale, with wor­lds as weights. An ob­ser­va­tion elimi­nates some of the pos­si­ble wor­lds, re­mov­ing weights and shift­ing the bal­ance of be­liefs.

Logic is like a tree, grow­ing from the seed of ax­ioms ac­cord­ing to in­fer­ence rules. For real-world agents, the pro­cess of growth is never com­plete; you never know all the con­se­quences of each be­lief.

Without know­ing how to com­bine the two, we can’t char­ac­ter­ize rea­son­ing prob­a­bil­is­ti­cally about math. But the “scale ver­sus tree” prob­lem also means that we don’t know how or­di­nary em­piri­cal rea­son­ing works.

Bayesian hy­poth­e­sis test­ing re­quires each hy­poth­e­sis to clearly de­clare which prob­a­bil­ities it as­signs to which ob­ser­va­tions. That way, you know how much to rescale the odds when you make an ob­ser­va­tion. If we don’t know the con­se­quences of a be­lief, we don’t know how much credit to give it for mak­ing pre­dic­tions.

This is like not know­ing where to place the weights on the scales of prob­a­bil­ity. We could try putting weights on both sides un­til a proof rules one out, but then the be­liefs just os­cillate for­ever rather than do­ing any­thing use­ful.

This forces us to grap­ple di­rectly with the prob­lem of a world that’s larger than the agent. We want some no­tion of bound­edly ra­tio­nal be­liefs about un­cer­tain con­se­quences; but any com­putable be­liefs about logic must have left out some­thing, since the tree of log­i­cal im­pli­ca­tions will grow larger than any con­tainer.

For a Bayesian, the scales of prob­a­bil­ity are bal­anced in pre­cisely such a way that no Dutch book can be made against them—no se­quence of bets that are a sure loss. But you can only ac­count for all Dutch books if you know all the con­se­quences of your be­liefs. Ab­sent that, some­one who has ex­plored other parts of the tree can Dutch-book you.

But hu­man math­e­mat­i­ci­ans don’t seem to run into any spe­cial difficulty in rea­son­ing about math­e­mat­i­cal un­cer­tainty, any more than we do with em­piri­cal un­cer­tainty. So what char­ac­ter­izes good rea­son­ing un­der math­e­mat­i­cal un­cer­tainty, if not im­mu­nity to mak­ing bad bets?

One an­swer is to weaken the no­tion of Dutch books so that we only al­low bets based on quickly com­putable parts of the tree. This is one of the ideas be­hind Garrabrant et al.’s “Log­i­cal In­duc­tion”, an early at­tempt at defin­ing some­thing like “Solomonoff in­duc­tion, but for rea­son­ing that in­cor­po­rates math­e­mat­i­cal un­cer­tainty”.

#### 3.4. High-level models

Another con­se­quence of the fact that the world is big­ger than you is that you need to be able to use high-level world mod­els: mod­els which in­volve things like ta­bles and chairs.

This is re­lated to the clas­si­cal sym­bol ground­ing prob­lem; but since we want a for­mal anal­y­sis which in­creases our trust in some sys­tem, the kind of model which in­ter­ests us is some­what differ­ent. This also re­lates to trans­parency and in­formed over­sight: world-mod­els should be made out of un­der­stand­able parts.

A re­lated ques­tion is how high-level rea­son­ing and low-level rea­son­ing re­late to each other and to in­ter­me­di­ate lev­els: multi-level world mod­els.

Stan­dard prob­a­bil­is­tic rea­son­ing doesn’t provide a very good ac­count of this sort of thing. It’s as though you have differ­ent Bayes nets which de­scribe the world at differ­ent lev­els of ac­cu­racy, and pro­cess­ing power limi­ta­tions force you to mostly use the less ac­cu­rate ones, so you have to de­cide how to jump to the more ac­cu­rate as needed.

Ad­di­tion­ally, the mod­els at differ­ent lev­els don’t line up perfectly, so you have a prob­lem of trans­lat­ing be­tween them; and the mod­els may have se­ri­ous con­tra­dic­tions be­tween them. This might be fine, since high-level mod­els are un­der­stood to be ap­prox­i­ma­tions any­way, or it could sig­nal a se­ri­ous prob­lem in the higher- or lower-level mod­els, re­quiring their re­vi­sion.

This is es­pe­cially in­ter­est­ing in the case of on­tolog­i­cal crises, in which ob­jects we value turn out not to be a part of “bet­ter” mod­els of the world.

It seems fair to say that ev­ery­thing hu­mans value ex­ists in high-level mod­els only, which from a re­duc­tion­is­tic per­spec­tive is “less real” than atoms and quarks. How­ever, be­cause our val­ues aren’t defined on the low level, we are able to keep our val­ues even when our knowl­edge of the low level rad­i­cally shifts. (We would also like to be able to say some­thing about what hap­pens to val­ues if the high level rad­i­cally shifts.)

Another crit­i­cal as­pect of em­bed­ded world mod­els is that the agent it­self must be in the model, since the agent seeks to un­der­stand the world, and the world can­not be fully sep­a­rated from one­self. This opens the door to difficult prob­lems of self-refer­ence and an­thropic de­ci­sion the­ory.

Nat­u­ral­ized in­duc­tion is the prob­lem of learn­ing world-mod­els which in­clude your­self in the en­vi­ron­ment. This is challeng­ing be­cause (as Cas­par Oester­held has put it) there is a type mis­match be­tween “men­tal stuff” and “physics stuff”.

AIXI con­ceives of the en­vi­ron­ment as if it were made with a slot which the agent fits into. We might in­tu­itively rea­son in this way, but we can also un­der­stand a phys­i­cal per­spec­tive from which this looks like a bad model. We might imag­ine in­stead that the agent sep­a­rately rep­re­sents: self-knowl­edge available to in­tro­spec­tion; hy­pothe­ses about what the uni­verse is like; and a “bridg­ing hy­poth­e­sis” con­nect­ing the two.

There are in­ter­est­ing ques­tions of how this could work. There’s also the ques­tion of whether this is the right struc­ture at all. It’s cer­tainly not how I imag­ine ba­bies learn­ing.

Thomas Nagel would say that this way of ap­proach­ing the prob­lem in­volves “views from nowhere”; each hy­poth­e­sis posits a world as if seen from out­side. This is per­haps a strange thing to do.

A spe­cial case of agents need­ing to rea­son about them­selves is agents need­ing to rea­son about their fu­ture self.

To make long-term plans, agents need to be able to model how they’ll act in the fu­ture, and have a cer­tain kind of trust in their fu­ture goals and rea­son­ing abil­ities. This in­cludes trust­ing fu­ture selves that have learned and grown a great deal.

In a tra­di­tional Bayesian frame­work, “learn­ing” means Bayesian up­dat­ing. But as we noted, Bayesian up­dat­ing re­quires that the agent start out large enough to con­sider a bunch of ways the world can be, and learn by rul­ing some of these out.

Embed­ded agents need re­source-limited, log­i­cally un­cer­tain up­dates, which don’t work like this.

Un­for­tu­nately, Bayesian up­dat­ing is the main way we know how to think about an agent pro­gress­ing through time as one unified agent. The Dutch book jus­tifi­ca­tion for Bayesian rea­son­ing is ba­si­cally say­ing this kind of up­dat­ing is the only way to not have the agent’s ac­tions on Mon­day work at cross pur­poses, at least a lit­tle, to the agent’s ac­tions on Tues­day.

Embed­ded agents are non-Bayesian. And non-Bayesian agents tend to get into wars with their fu­ture selves.

Which brings us to our next set of prob­lems: ro­bust del­e­ga­tion.

### 4. Ro­bust delegation

Be­cause the world is big, the agent as it is may be in­ad­e­quate to ac­com­plish its goals, in­clud­ing in its abil­ity to think.

Be­cause the agent is made of parts, it can im­prove it­self and be­come more ca­pa­ble.

Im­prove­ments can take many forms: The agent can make tools, the agent can make suc­ces­sor agents, or the agent can just learn and grow over time. How­ever, the suc­ces­sors or tools need to be more ca­pa­ble for this to be worth­while.

This gives rise to a spe­cial type of prin­ci­pal/​agent prob­lem:

You have an ini­tial agent, and a suc­ces­sor agent. The ini­tial agent gets to de­cide ex­actly what the suc­ces­sor agent looks like. The suc­ces­sor agent, how­ever, is much more in­tel­li­gent and pow­er­ful than the ini­tial agent. We want to know how to have the suc­ces­sor agent ro­bustly op­ti­mize the ini­tial agent’s goals.

Here are three ex­am­ples of forms this prin­ci­pal/​agent prob­lem can take:

• In the AI al­ign­ment prob­lem, a hu­man is try­ing to build an AI sys­tem which can be trusted to help with the hu­man’s goals.

• In the tiling agents prob­lem, an agent is try­ing to make sure it can trust its fu­ture selves to help with its own goals.

• Or we can con­sider a harder ver­sion of the tiling prob­lem—sta­ble self-im­prove­ment—where an AI sys­tem has to build a suc­ces­sor which is more in­tel­li­gent than it­self, while still be­ing trust­wor­thy and helpful.

For a hu­man anal­ogy which in­volves no AI, you can think about the prob­lem of suc­ces­sion in roy­alty, or more gen­er­ally the prob­lem of set­ting up or­ga­ni­za­tions to achieve de­sired goals with­out los­ing sight of their pur­pose over time.

The difficulty seems to be twofold:

First, a hu­man or AI agent may not fully un­der­stand it­self and its own goals. If an agent can’t write out what it wants in ex­act de­tail, that makes it hard for it to guaran­tee that its suc­ces­sor will ro­bustly help with the goal.

Se­cond, the idea be­hind del­e­gat­ing work is that you not have to do all the work your­self. You want the suc­ces­sor to be able to act with some de­gree of au­ton­omy, in­clud­ing learn­ing new things that you don’t know, and wield­ing new skills and ca­pa­bil­ities.

In the limit, a re­ally good for­mal ac­count of ro­bust del­e­ga­tion should be able to han­dle ar­bi­trar­ily ca­pa­ble suc­ces­sors with­out throw­ing up any er­rors—like a hu­man or AI build­ing an un­be­liev­ably smart AI, or like an agent that just keeps learn­ing and grow­ing for so many years that it ends up much smarter than its past self.

The prob­lem is not (just) that the suc­ces­sor agent might be mal­i­cious. The prob­lem is that we don’t even know what it means not to be.

This prob­lem seems hard from both points of view.

The ini­tial agent needs to figure out how re­li­able and trust­wor­thy some­thing more pow­er­ful than it is, which seems very hard. But the suc­ces­sor agent has to figure out what to do in situ­a­tions that the ini­tial agent can’t even un­der­stand, and try to re­spect the goals of some­thing that the suc­ces­sor can see is in­con­sis­tent, which also seems very hard.

At first, this may look like a less fun­da­men­tal prob­lem than “make de­ci­sions” or “have mod­els”. But the view on which there are mul­ti­ple forms of the “build a suc­ces­sor” prob­lem is it­self a du­al­is­tic view.

To an em­bed­ded agent, the fu­ture self is not priv­ileged; it is just an­other part of the en­vi­ron­ment. There isn’t a deep differ­ence be­tween build­ing a suc­ces­sor that shares your goals, and just mak­ing sure your own goals stay the same over time.

So, al­though I talk about “ini­tial” and “suc­ces­sor” agents, re­mem­ber that this isn’t just about the nar­row prob­lem hu­mans cur­rently face of aiming a suc­ces­sor. This is about the fun­da­men­tal prob­lem of be­ing an agent that per­sists and learns over time.

We call this cluster of prob­lems Ro­bust Del­e­ga­tion. Ex­am­ples in­clude:

#### 4.1. Vingean reflection

Imag­ine you are play­ing the CIRL game with a tod­dler.

CIRL means Co­op­er­a­tive In­verse Re­in­force­ment Learn­ing. The idea be­hind CIRL is to define what it means for a robot to col­lab­o­rate with a hu­man. The robot tries to pick helpful ac­tions, while si­mul­ta­neously try­ing to figure out what the hu­man wants.

A lot of cur­rent work on ro­bust del­e­ga­tion comes from the goal of al­ign­ing AI sys­tems with what hu­mans want. So usu­ally, we think about this from the point of view of the hu­man.

But now con­sider the prob­lem faced by a smart robot, where they’re try­ing to help some­one who is very con­fused about the uni­verse. Imag­ine try­ing to help a tod­dler op­ti­mize their goals.

• From your stand­point, the tod­dler may be too ir­ra­tional to be seen as op­ti­miz­ing any­thing.

• The tod­dler may have an on­tol­ogy in which it is op­ti­miz­ing some­thing, but you can see that on­tol­ogy doesn’t make sense.

• Maybe you no­tice that if you set up ques­tions in the right way, you can make the tod­dler seem to want al­most any­thing.

Part of the prob­lem is that the “helping” agent has to be big­ger in some sense in or­der to be more ca­pa­ble; but this seems to im­ply that the “helped” agent can’t be a very good su­per­vi­sor for the “helper”.

For ex­am­ple, up­date­less de­ci­sion the­ory elimi­nates dy­namic in­con­sis­ten­cies in de­ci­sion the­ory by, rather than max­i­miz­ing ex­pected util­ity of your ac­tion given what you know, max­i­miz­ing ex­pected util­ity of re­ac­tions to ob­ser­va­tions, from a state of ig­no­rance.

Ap­peal­ing as this may be as a way to achieve re­flec­tive con­sis­tency, it cre­ates a strange situ­a­tion in terms of com­pu­ta­tional com­plex­ity: If ac­tions are type , and ob­ser­va­tions are type , re­ac­tions to ob­ser­va­tions are type —a much larger space to op­ti­mize over than alone. And we’re ex­pect­ing our smaller self to be able to do that!

One way to more crisply state the prob­lem is: We should be able to trust that our fu­ture self is ap­ply­ing its in­tel­li­gence to the pur­suit of our goals with­out be­ing able to pre­dict pre­cisely what our fu­ture self will do. This crite­rion is called Vingean re­flec­tion.

For ex­am­ple, you might plan your driv­ing route be­fore vis­it­ing a new city, but you do not plan your steps. You plan to some level of de­tail, and trust that your fu­ture self can figure out the rest.

Vingean re­flec­tion is difficult to ex­am­ine via clas­si­cal Bayesian de­ci­sion the­ory be­cause Bayesian de­ci­sion the­ory as­sumes log­i­cal om­ni­science. Given log­i­cal om­ni­science, the as­sump­tion “the agent knows its fu­ture ac­tions are ra­tio­nal” is syn­ony­mous with the as­sump­tion “the agent knows its fu­ture self will act ac­cord­ing to one par­tic­u­lar op­ti­mal policy which the agent can pre­dict in ad­vance”.

We have some limited mod­els of Vingean re­flec­tion (see “Tiling Agents for Self-Mod­ify­ing AI, and the Löbian Ob­sta­cle” by Yud­kowsky and Her­reshoff). A suc­cess­ful ap­proach must walk the nar­row line be­tween two prob­lems:

• The Löbian Ob­sta­cle: Agents who trust their fu­ture self be­cause they trust the out­put of their own rea­son­ing are in­con­sis­tent.

• The Pro­cras­ti­na­tion Para­dox: Agents who trust their fu­ture selves with­out rea­son tend to be con­sis­tent but un­sound and un­trust­wor­thy, and will put off tasks for­ever be­cause they can do them later.

The Vingean re­flec­tion re­sults so far ap­ply only to limited sorts of de­ci­sion pro­ce­dures, such as satis­ficers aiming for a thresh­old of ac­cept­abil­ity. So there is plenty of room for im­prove­ment, get­ting tiling re­sults for more use­ful de­ci­sion pro­ce­dures and un­der weaker as­sump­tions.

How­ever, there is more to the ro­bust del­e­ga­tion prob­lem than just tiling and Vingean re­flec­tion.

When you con­struct an­other agent, rather than del­e­gat­ing to your fu­ture self, you more di­rectly face a prob­lem of value load­ing.

#### 4.2. Good­hart’s law

The main prob­lems in the con­text of value load­ing:

The mis­speci­fi­ca­tion-am­plify­ing effect is known as Good­hart’s law, named for Charles Good­hart’s ob­ser­va­tion: “Any ob­served statis­ti­cal reg­u­lar­ity will tend to col­lapse once pres­sure is placed upon it for con­trol pur­poses.”

When we spec­ify a tar­get for op­ti­miza­tion, it is rea­son­able to ex­pect it to be cor­re­lated with what we want—highly cor­re­lated, in some cases. Un­for­tu­nately, how­ever, this does not mean that op­ti­miz­ing it will get us closer to what we want—es­pe­cially at high lev­els of op­ti­miza­tion.

There are (at least) four types of Good­hart: re­gres­sional, ex­tremal, causal, and ad­ver­sar­ial.

Re­gres­sional Good­hart hap­pens when there is a less than perfect cor­re­la­tion be­tween the proxy and the goal. It is more com­monly known as the op­ti­mizer’s curse, and it is re­lated to re­gres­sion to the mean.

An ex­am­ple of re­gres­sional Good­hart is that you might draft play­ers for a bas­ket­ball team based on height alone. This isn’t a perfect heuris­tic, but there is a cor­re­la­tion be­tween height and bas­ket­ball abil­ity, which you can make use of in mak­ing your choices.

It turns out that, in a cer­tain sense, you will be pre­dictably dis­ap­pointed if you ex­pect the gen­eral trend to hold up as strongly for your se­lected team.

Stated in statis­ti­cal terms: an un­bi­ased es­ti­mate of given is not an un­bi­ased es­ti­mate of when we se­lect for the best . In that sense, we can ex­pect to be dis­ap­pointed when we use as a proxy for for op­ti­miza­tion pur­poses.

(The graphs in this sec­tion are hand-drawn to help illus­trate the rele­vant con­cepts.)

Us­ing a Bayes es­ti­mate in­stead of an un­bi­ased es­ti­mate, we can elimi­nate this sort of pre­dictable dis­ap­point­ment. The Bayes es­ti­mate ac­counts for the noise in , bend­ing to­ward typ­i­cal val­ues.

This doesn’t nec­es­sar­ily al­low us to get a bet­ter value, since we still only have the in­for­ma­tion con­tent of to work with. How­ever, it some­times may. If is nor­mally dis­tributed with var­i­ance , and is with even odds of or , a Bayes es­ti­mate will give bet­ter op­ti­miza­tion re­sults by al­most en­tirely re­mov­ing the noise.

Re­gres­sional Good­hart seems like the eas­iest form of Good­hart to beat: just use Bayes!

How­ever, there are two big prob­lems with this solu­tion:

• Bayesian es­ti­ma­tors are very of­ten in­tractable in cases of in­ter­est.

• It only makes sense to trust the Bayes es­ti­mate un­der a re­al­iz­abil­ity as­sump­tion.

A case where both of these prob­lems be­come crit­i­cal is com­pu­ta­tional learn­ing the­ory.

It of­ten isn’t com­pu­ta­tion­ally fea­si­ble to calcu­late the Bayesian ex­pected gen­er­al­iza­tion er­ror of a hy­poth­e­sis. And even if you could, you would still need to won­der whether your cho­sen prior re­flected the world well enough.

In ex­tremal Good­hart, op­ti­miza­tion pushes you out­side the range where the cor­re­la­tion ex­ists, into por­tions of the dis­tri­bu­tion which be­have very differ­ently.

This is es­pe­cially scary be­cause it tends to in­volves op­ti­miz­ers be­hav­ing in sharply differ­ent ways in differ­ent con­texts, of­ten with lit­tle or no warn­ing. You might not be able to ob­serve the proxy break­ing down at all when you have weak op­ti­miza­tion, but once the op­ti­miza­tion be­comes strong enough, you can en­ter a very differ­ent do­main.

The differ­ence be­tween ex­tremal Good­hart and re­gres­sional Good­hart is re­lated to the clas­si­cal in­ter­po­la­tion/​ex­trap­o­la­tion dis­tinc­tion.

Be­cause ex­tremal Good­hart in­volves a sharp change in be­hav­ior as the sys­tem is scaled up, it’s harder to an­ti­ci­pate than re­gres­sional Good­hart.

As in the re­gres­sional case, a Bayesian solu­tion ad­dresses this con­cern in prin­ci­ple, if you trust a prob­a­bil­ity dis­tri­bu­tion to re­flect the pos­si­ble risks suffi­ciently well. How­ever, the re­al­iz­abil­ity con­cern seems even more promi­nent here.

Can a prior be trusted to an­ti­ci­pate prob­lems with pro­pos­als, when those pro­pos­als have been highly op­ti­mized to look good to that spe­cific prior? Cer­tainly a hu­man’s judg­ment couldn’t be trusted un­der such con­di­tions—an ob­ser­va­tion which sug­gests that this prob­lem will re­main even if a sys­tem’s judg­ments about val­ues perfectly re­flect a hu­man’s.

We might say that the prob­lem is this: “typ­i­cal” out­puts avoid ex­tremal Good­hart, but “op­ti­miz­ing too hard” takes you out of the realm of the typ­i­cal.

But how can we for­mal­ize “op­ti­miz­ing too hard” in de­ci­sion-the­o­retic terms?

Quan­tiliza­tion offers a for­mal­iza­tion of “op­ti­mize this some, but don’t op­ti­mize too much”.

Imag­ine a proxy as a “cor­rupted” ver­sion of the func­tion we re­ally want, . There might be differ­ent re­gions where the cor­rup­tion is bet­ter or worse.

Sup­pose that we can ad­di­tion­ally spec­ify a “trusted” prob­a­bil­ity dis­tri­bu­tion , for which we are con­fi­dent that the av­er­age er­ror is be­low some thresh­old .

By stipu­lat­ing and , we give in­for­ma­tion about where to find low-er­ror points, with­out need­ing to have any es­ti­mates of or of the ac­tual er­ror at any one point.

When we se­lect ac­tions from at ran­dom, we can be sure re­gard­less that there’s a low prob­a­bil­ity of high er­ror.

So, how do we use this to op­ti­mize? A quan­tilizer se­lects from , but dis­card­ing all but the top frac­tion ; for ex­am­ple, the top 1%. In this vi­su­al­iza­tion, I’ve ju­di­ciously cho­sen a frac­tion that still has most of the prob­a­bil­ity con­cen­trated on the “typ­i­cal” op­tions, rather than on out­liers:

By quan­tiliz­ing, we can guaran­tee that if we over­es­ti­mate how good some­thing is, we’re over­es­ti­mat­ing by at most in ex­pec­ta­tion. This is be­cause in the worst case, all of the over­es­ti­ma­tion was of the best op­tions.

We can there­fore choose an ac­cept­able risk level, , and set the pa­ram­e­ter as .

Quan­tiliza­tion is in some ways very ap­peal­ing, since it al­lows us to spec­ify safe classes of ac­tions with­out trust­ing ev­ery in­di­vi­d­ual ac­tion in the class—or with­out trust­ing any in­di­vi­d­ual ac­tion in the class.

If you have a suffi­ciently large heap of ap­ples, and there’s only one rot­ten ap­ple in the heap, choos­ing ran­domly is still very likely safe. By “op­ti­miz­ing less hard” and pick­ing a ran­dom good-enough ac­tion, we make the re­ally ex­treme op­tions low-prob­a­bil­ity. In con­trast, if we had op­ti­mized as hard as pos­si­ble, we might have ended up se­lect­ing from only bad ap­ples.

How­ever, this ap­proach also leaves a lot to be de­sired. Where do “trusted” dis­tri­bu­tions come from? How do you es­ti­mate the ex­pected er­ror , or se­lect the ac­cept­able risk level ? Quan­tiliza­tion is a risky ap­proach be­cause gives you a knob to turn that will seem­ingly im­prove perfor­mance, while in­creas­ing risk, un­til (pos­si­bly sud­den) failure.

Ad­di­tion­ally, quan­tiliza­tion doesn’t seem likely to tile. That is, a quan­tiliz­ing agent has no spe­cial rea­son to pre­serve the quan­tiliza­tion al­gorithm when it makes im­prove­ments to it­self or builds new agents.

So there seems to be room for im­prove­ment in how we han­dle ex­tremal Good­hart.

Another way op­ti­miza­tion can go wrong is when the act of se­lect­ing for a proxy breaks the con­nec­tion to what we care about. Causal Good­hart hap­pens when you ob­serve a cor­re­la­tion be­tween proxy and goal, but when you in­ter­vene to in­crease the proxy, you fail to in­crease the goal be­cause the ob­served cor­re­la­tion was not causal in the right way.

An ex­am­ple of causal Good­hart is that you might try to make it rain by car­ry­ing an um­brella around. The only way to avoid this sort of mis­take is to get coun­ter­fac­tu­als right.

This might seem like punt­ing to de­ci­sion the­ory, but the con­nec­tion here en­riches ro­bust del­e­ga­tion and de­ci­sion the­ory al­ike.

Coun­ter­fac­tu­als have to ad­dress con­cerns of trust due to tiling con­cerns—the need for de­ci­sion-mak­ers to rea­son about their own fu­ture de­ci­sions. At the same time, trust has to ad­dress coun­ter­fac­tual con­cerns be­cause of causal Good­hart.

Once again, one of the big challenges here is re­al­iz­abil­ity. As we noted in our dis­cus­sion of em­bed­ded world-mod­els, even if you have the right the­ory of how coun­ter­fac­tu­als work in gen­eral, Bayesian learn­ing doesn’t provide much of a guaran­tee that you’ll learn to se­lect ac­tions well, un­less we as­sume re­al­iz­abil­ity.

Fi­nally, there is ad­ver­sar­ial Good­hart, in which agents ac­tively make our proxy worse by in­tel­li­gently ma­nipu­lat­ing it.

This cat­e­gory is what peo­ple most of­ten have in mind when they in­ter­pret Good­hart’s re­mark. And at first glance, it may not seem as rele­vant to our con­cerns here. We want to un­der­stand in for­mal terms how agents can trust their fu­ture selves, or trust helpers they built from scratch. What does that have to do with ad­ver­saries?

The short an­swer is: when search­ing in a large space which is suffi­ciently rich, there are bound to be some el­e­ments of that space which im­ple­ment ad­ver­sar­ial strate­gies. Un­der­stand­ing op­ti­miza­tion in gen­eral re­quires us to un­der­stand how suffi­ciently smart op­ti­miz­ers can avoid ad­ver­sar­ial Good­hart. (We’ll come back to this point in our dis­cus­sion of sub­sys­tem al­ign­ment.)

The ad­ver­sar­ial var­i­ant of Good­hart’s law is even harder to ob­serve at low lev­els of op­ti­miza­tion, both be­cause the ad­ver­saries won’t want to start ma­nipu­lat­ing un­til af­ter test time is over, and be­cause ad­ver­saries that come from the sys­tem’s own op­ti­miza­tion won’t show up un­til the op­ti­miza­tion is pow­er­ful enough.

Th­ese four forms of Good­hart’s law work in very differ­ent ways—and roughly speak­ing, they tend to start ap­pear­ing at suc­ces­sively higher lev­els of op­ti­miza­tion power, be­gin­ning with re­gres­sional Good­hart and pro­ceed­ing to causal, then ex­tremal, then ad­ver­sar­ial. So be care­ful not to think you’ve con­quered Good­hart’s law be­cause you’ve solved some of them.

#### 4.3. Stable poin­t­ers to value

Be­sides anti-Good­hart mea­sures, it would ob­vi­ously help to be able to spec­ify what we want pre­cisely. Re­mem­ber that none of these prob­lems would come up if a sys­tem were op­ti­miz­ing what we wanted di­rectly, rather than op­ti­miz­ing a proxy.

Un­for­tu­nately, this is hard. So can the AI sys­tem we’re build­ing help us with this?

More gen­er­ally, can a suc­ces­sor agent help its pre­de­ces­sor solve this? Maybe it can use its in­tel­lec­tual ad­van­tages to figure out what we want?

AIXI learns what to do through a re­ward sig­nal which it gets from the en­vi­ron­ment. We can imag­ine hu­mans have a but­ton which they press when AIXI does some­thing they like.

The prob­lem with this is that AIXI will ap­ply its in­tel­li­gence to the prob­lem of tak­ing con­trol of the re­ward but­ton. This is the prob­lem of wire­head­ing.

This kind of be­hav­ior is po­ten­tially very difficult to an­ti­ci­pate; the sys­tem may de­cep­tively be­have as in­tended dur­ing train­ing, plan­ning to take con­trol af­ter de­ploy­ment. This is called a “treach­er­ous turn”.

Maybe we build the re­ward but­ton into the agent, as a black box which is­sues re­wards based on what is go­ing on. The box could be an in­tel­li­gent sub-agent in its own right, which figures out what re­wards hu­mans would want to give. The box could even defend it­self by is­su­ing pun­ish­ments for ac­tions aimed at mod­ify­ing the box.

In the end, though, if the agent un­der­stands the situ­a­tion, it will be mo­ti­vated to take con­trol any­way.

If the agent is told to get high out­put from “the but­ton” or “the box”, then it will be mo­ti­vated to hack those things. How­ever, if you run the ex­pected out­comes of plans through the ac­tual re­ward-is­su­ing box, then plans to hack the box are eval­u­ated by the box it­self, which won’t find the idea ap­peal­ing.

Daniel Dewey calls the sec­ond sort of agent an ob­ser­va­tion-util­ity max­i­mizer. (Others have in­cluded ob­ser­va­tion-util­ity agents within a more gen­eral no­tion of re­in­force­ment learn­ing.)

I find it very in­ter­est­ing how you can try all sorts of things to stop an RL agent from wire­head­ing, but the agent keeps work­ing against it. Then, you make the shift to ob­ser­va­tion-util­ity agents and the prob­lem van­ishes.

How­ever, we still have the prob­lem of spec­i­fy­ing . Daniel Dewey points out that ob­ser­va­tion-util­ity agents can still use learn­ing to ap­prox­i­mate over time; we just can’t treat as a black box. An RL agent tries to learn to pre­dict the re­ward func­tion, whereas an ob­ser­va­tion-util­ity agent uses es­ti­mated util­ity func­tions from a hu­man-speci­fied value-learn­ing prior.

How­ever, it’s still difficult to spec­ify a learn­ing pro­cess which doesn’t lead to other prob­lems. For ex­am­ple, if you’re try­ing to learn what hu­mans want, how do you ro­bustly iden­tify “hu­mans” in the world? Merely statis­ti­cally de­cent ob­ject recog­ni­tion could lead back to wire­head­ing.

Even if you suc­cess­fully solve that prob­lem, the agent might cor­rectly lo­cate value in the hu­man, but might still be mo­ti­vated to change hu­man val­ues to be eas­ier to satisfy. For ex­am­ple, sup­pose there is a drug which mod­ifies hu­man prefer­ences to only care about us­ing the drug. An ob­ser­va­tion-util­ity agent could be mo­ti­vated to give hu­mans that drug in or­der to make its job eas­ier. This is called the hu­man ma­nipu­la­tion prob­lem.

Any­thing marked as the true repos­i­tory of value gets hacked. Whether this is one of the four types of Good­hart­ing, or a fifth, or some­thing all its own, it seems like a theme.

The challenge, then, is to cre­ate sta­ble poin­t­ers to what we value: an in­di­rect refer­ence to val­ues not di­rectly available to be op­ti­mized, which doesn’t thereby en­courage hack­ing the repos­i­tory of value.

One im­por­tant point is made by Tom Ever­itt et al. in “Re­in­force­ment Learn­ing with a Cor­rupted Re­ward Chan­nel”: the way you set up the feed­back loop makes a huge differ­ence.

They draw the fol­low­ing pic­ture:

• In Stan­dard RL, the feed­back about the value of a state comes from the state it­self, so cor­rupt states can be “self-ag­gran­diz­ing”.

• In De­cou­pled RL, the feed­back about the qual­ity of a state comes from some other state, mak­ing it pos­si­ble to learn cor­rect val­ues even when some feed­back is cor­rupt.

In some sense, the challenge is to put the origi­nal, small agent in the feed­back loop in the right way. How­ever, the prob­lems with up­date­less rea­son­ing men­tioned ear­lier make this hard; the origi­nal agent doesn’t know enough.

One way to try to ad­dress this is through in­tel­li­gence am­plifi­ca­tion: try to turn the origi­nal agent into a more ca­pa­ble one with the same val­ues, rather than cre­at­ing a suc­ces­sor agent from scratch and try­ing to get value load­ing right.

For ex­am­ple, Paul Chris­ti­ano pro­poses an ap­proach in which the small agent is simu­lated many times in a large tree, which can perform com­plex com­pu­ta­tions by split­ting prob­lems into parts.

How­ever, this is still fairly de­mand­ing for the small agent: it doesn’t just need to know how to break prob­lems down into more tractable pieces; it also needs to know how to do so with­out giv­ing rise to ma­lign sub­com­pu­ta­tions.

For ex­am­ple, since the small agent can use the copies of it­self to get a lot of com­pu­ta­tional power, it could eas­ily try to use a brute-force search for solu­tions that ends up run­ning afoul of Good­hart’s law.

This is­sue is the sub­ject of the next sec­tion: sub­sys­tem al­ign­ment.

### 5. Sub­sys­tem alignment

You want to figure some­thing out, but you don’t know how to do that yet.

You have to some­how break up the task into sub-com­pu­ta­tions. There is no atomic act of “think­ing”; in­tel­li­gence must be built up of non-in­tel­li­gent parts.

The agent be­ing made of parts is part of what made coun­ter­fac­tu­als hard, since the agent may have to rea­son about im­pos­si­ble con­figu­ra­tions of those parts.

Be­ing made of parts is what makes self-rea­son­ing and self-mod­ifi­ca­tion even pos­si­ble.

What we’re pri­mar­ily go­ing to dis­cuss in this sec­tion, though, is an­other prob­lem: when the agent is made of parts, there could be ad­ver­saries not just in the ex­ter­nal en­vi­ron­ment, but in­side the agent as well.

This cluster of prob­lems is Sub­sys­tem Align­ment: en­sur­ing that sub­sys­tems are not work­ing at cross pur­poses; avoid­ing sub­pro­cesses op­ti­miz­ing for un­in­tended goals.

• be­nign induction

• be­nign optimization

• transparency

• mesa-optimizers

#### 5.1. Ro­bust­ness to rel­a­tive scale

Here’s a straw agent de­sign:

The epistemic sub­sys­tem just wants ac­cu­rate be­liefs. The in­stru­men­tal sub­sys­tem uses those be­liefs to track how well it is do­ing. If the in­stru­men­tal sub­sys­tem gets too ca­pa­ble rel­a­tive to the epistemic sub­sys­tem, it may de­cide to try to fool the epistemic sub­sys­tem, as de­picted.

If the epistemic sub­sys­tem gets too strong, that could also pos­si­bly yield bad out­comes.

This agent de­sign treats the sys­tem’s epistemic and in­stru­men­tal sub­sys­tems as dis­crete agents with goals of their own, which is not par­tic­u­larly re­al­is­tic. How­ever, we saw in the sec­tion on wire­head­ing that the prob­lem of sub­sys­tems work­ing at cross pur­poses is hard to avoid. And this is a harder prob­lem if we didn’t in­ten­tion­ally build the rele­vant sub­sys­tems.

One rea­son to avoid boot­ing up sub-agents who want differ­ent things is that we want ro­bust­ness to rel­a­tive scale.

An ap­proach is ro­bust to scale if it still works, or fails grace­fully, as you scale ca­pa­bil­ities. There are three types: ro­bust­ness to scal­ing up; ro­bust­ness to scal­ing down; and ro­bust­ness to rel­a­tive scale.

• Ro­bust­ness to scal­ing up means that your sys­tem doesn’t stop be­hav­ing well if it gets bet­ter at op­ti­miz­ing. One way to check this is to think about what would hap­pen if the func­tion the AI op­ti­mizes were ac­tu­ally max­i­mized. Think Good­hart’s law.

• Ro­bust­ness to scal­ing down means that your sys­tem still works if made less pow­er­ful. Of course, it may stop be­ing use­ful; but it should fail safely and with­out un­nec­es­sary costs.

Your sys­tem might work if it can ex­actly max­i­mize some func­tion, but is it safe if you ap­prox­i­mate? For ex­am­ple, maybe a sys­tem is safe if it can learn hu­man val­ues very pre­cisely, but ap­prox­i­ma­tion makes it in­creas­ingly mis­al­igned.

• Ro­bust­ness to rel­a­tive scale means that your de­sign does not rely on the agent’s sub­sys­tems be­ing similarly pow­er­ful. For ex­am­ple, GAN (Gen­er­a­tive Ad­ver­sar­ial Net­work) train­ing can fail if one sub-net­work gets too strong, be­cause there’s no longer any train­ing sig­nal.

Lack of ro­bust­ness to scale isn’t nec­es­sar­ily some­thing which kills a pro­posal, but it is some­thing to be aware of; lack­ing ro­bust­ness to scale, you need strong rea­son to think you’re at the right scale.

Ro­bust­ness to rel­a­tive scale is par­tic­u­larly im­por­tant for sub­sys­tem al­ign­ment. An agent with in­tel­li­gent sub-parts should not rely on be­ing able to out­smart them, un­less we have a strong ac­count of why this is always pos­si­ble.

The big-pic­ture moral: aim to have a unified sys­tem that doesn’t work at cross pur­poses to it­self.

Why would any­one make an agent with parts fight­ing against one an­other? There are three ob­vi­ous rea­sons: sub­goals, poin­t­ers, and search.

Split­ting up a task into sub­goals may be the only way to effi­ciently find a solu­tion. How­ever, a sub­goal com­pu­ta­tion shouldn’t com­pletely for­get the big pic­ture!

An agent de­signed to build houses should not boot up a sub-agent who cares only about build­ing stairs.

One in­tu­itive desider­a­tum is that al­though sub­sys­tems need to have their own goals in or­der to de­com­pose prob­lems into parts, the sub­goals need to “point back” ro­bustly to the main goal.

A house-build­ing agent might spin up a sub­sys­tem that cares only about stairs, but only cares about stairs in the con­text of houses.

How­ever, you need to do this in a way that doesn’t just amount to your house-build­ing sys­tem hav­ing a sec­ond house-build­ing sys­tem in­side its head. This brings me to the next item:

Poin­t­ers: It may be difficult for sub­sys­tems to carry the whole-sys­tem goal around with them, since they need to be re­duc­ing the prob­lem. How­ever, this kind of in­di­rec­tion seems to en­courage situ­a­tions in which differ­ent sub­sys­tems’ in­cen­tives are mis­al­igned.

As we saw in the ex­am­ple of the epistemic and in­stru­men­tal sub­sys­tems, as soon as we start op­ti­miz­ing some sort of ex­pec­ta­tion, rather than di­rectly get­ting feed­back about what we’re do­ing on the met­ric that’s ac­tu­ally im­por­tant, we may cre­ate per­verse in­cen­tives—that’s Good­hart’s law.

How do we ask a sub­sys­tem to “do X” as op­posed to “con­vince the wider sys­tem that I’m do­ing X”, with­out pass­ing along the en­tire over­ar­ch­ing goal-sys­tem?

This is similar to the way we wanted suc­ces­sor agents to ro­bustly point at val­ues, since it is too hard to write val­ues down. How­ever, in this case, learn­ing the val­ues of the larger agent wouldn’t make any sense ei­ther; sub­sys­tems and sub­goals need to be smaller.

It might not be that difficult to solve sub­sys­tem al­ign­ment for sub­sys­tems which hu­mans en­tirely de­sign, or sub­goals which an AI ex­plic­itly spins up. If you know how to avoid mis­al­ign­ment by de­sign and ro­bustly del­e­gate your goals, both prob­lems seem solv­able.

How­ever, it doesn’t seem pos­si­ble to de­sign all sub­sys­tems so ex­plic­itly. At some point, in solv­ing a prob­lem, you’ve split it up as much as you know how to and must rely on some trial and er­ror.

This brings us to the third rea­son sub­sys­tems might be op­ti­miz­ing differ­ent things, search: solv­ing a prob­lem by look­ing through a rich space of pos­si­bil­ities, a space which may it­self con­tain mis­al­igned sub­sys­tems.

ML re­searchers are quite fa­mil­iar with the phe­nomenon: it’s eas­ier to write a pro­gram which finds a high-perfor­mance ma­chine trans­la­tion sys­tem for you than to di­rectly write one your­self.

In the long run, this pro­cess can go one step fur­ther. For a rich enough prob­lem and an im­pres­sive enough search pro­cess, the solu­tions found via search might them­selves be in­tel­li­gently op­ti­miz­ing some­thing.

This might hap­pen by ac­ci­dent, or be pur­pose­fully en­g­ineered as a strat­egy for solv­ing difficult prob­lems. Either way, it stands a good chance of ex­ac­er­bat­ing Good­hart-type prob­lems—you now effec­tively have two chances for mis­al­ign­ment, where you pre­vi­ously had one.

This prob­lem is de­scribed in Hub­inger, et al.’s “Risks from Learned Op­ti­miza­tion in Ad­vanced Ma­chine Learn­ing Sys­tems”.

Let’s call the origi­nal search pro­cess the base op­ti­mizer, and the search pro­cess found via search a mesa-op­ti­mizer.

“Mesa” is the op­po­site of “meta”. Whereas a “meta-op­ti­mizer” is an op­ti­mizer de­signed to pro­duce a new op­ti­mizer, a “mesa-op­ti­mizer” is any op­ti­mizer gen­er­ated by the origi­nal op­ti­mizer—whether or not the pro­gram­mers wanted their base op­ti­mizer to be search­ing for new op­ti­miz­ers.

“Op­ti­miza­tion” and “search” are am­bigu­ous terms. I’ll think of them as any al­gorithm which can be nat­u­rally in­ter­preted as do­ing sig­nifi­cant com­pu­ta­tional work to “find” an ob­ject that scores highly on some ob­jec­tive func­tion.

The ob­jec­tive func­tion of the base op­ti­mizer is not nec­es­sar­ily the same as that of the mesa-op­ti­mizer. If the base op­ti­mizer wants to make pizza, the new op­ti­mizer may en­joy knead­ing dough, chop­ping in­gre­di­ents, et cetera.

The new op­ti­mizer’s ob­jec­tive func­tion must be helpful for the base ob­jec­tive, at least in the ex­am­ples the base op­ti­mizer is check­ing. Other­wise, the mesa-op­ti­mizer would not have been se­lected.

How­ever, the mesa-op­ti­mizer must re­duce the prob­lem some­how; there is no point to it run­ning the ex­act same search all over again. So it seems like its ob­jec­tives will tend to be like good heuris­tics; eas­ier to op­ti­mize, but differ­ent from the base ob­jec­tive in gen­eral.

Why might a differ­ence be­tween base ob­jec­tives and mesa-ob­jec­tives be con­cern­ing, if the new op­ti­mizer is scor­ing highly on the base ob­jec­tive any­way? It’s about the in­ter­play with what’s re­ally wanted. Even if we get value speci­fi­ca­tion ex­actly right, there will always be some dis­tri­bu­tional shift be­tween the train­ing set and de­ploy­ment. (See Amodei, et al.’s “Con­crete Prob­lems in AI Safety”.)

Distri­bu­tional shifts which would be small in or­di­nary cases may make a big differ­ence to a ca­pa­ble mesa-op­ti­mizer, which may ob­serve the slight differ­ence and figure out how to cap­i­tal­ize on it for its own ob­jec­tive.

Ac­tu­ally, to even use the term “dis­tri­bu­tional shift” seems wrong in the con­text of em­bed­ded agency. The world is not i.i.d. The ana­log of “no dis­tri­bu­tional shift” would be to have an ex­act model of the whole fu­ture rele­vant to what you want to op­ti­mize, and the abil­ity to run it over and over dur­ing train­ing. So we need to deal with mas­sive “dis­tri­bu­tional shift”.

We may also want to op­ti­mize for things that aren’t ex­actly what we want. The ob­vi­ous way to avoid agents that pur­sue sub­goals at the cost of the over­all goal is to have the sub­sys­tems not be agen­tic. Just search over a bunch of ways to make stairs, don’t make some­thing that cares about stairs.

The prob­lem is then that pow­er­ful mesa-op­ti­miz­ers are op­ti­miz­ing some­thing the base sys­tem doesn’t care about, and that the mesa-op­ti­miz­ers will have a con­ver­gent in­cen­tive to be agen­tic.

#### 5.3. Treach­er­ous turns

Search pro­cesses that are pow­er­ful enough to lo­cate other smart search pro­cesses raise an­other prob­lem: the mesa-op­ti­mizer may be­come aware of the base op­ti­mizer, in which case it might start ex­plic­itly try­ing to do well on the base ob­jec­tive func­tion purely in or­der to be kept around, while look­ing for any signs that it has left train­ing and can stop pre­tend­ing. This cre­ates a ver­sion of Nick Bostrom’s “treach­er­ous turn”.

This is the same story we saw in ad­ver­sar­ial Good­hart: there is some­thing agen­tic in the search space, which re­sponds to our choice of proxy in a way which makes our proxy a bad one.

You might try to avoid treach­er­ous turns by re­peat­edly simu­lat­ing “end of train­ing, time for de­ploy­ment” dur­ing train­ing, in a kind of nested-dream setup. The goal is to test enough lay­ers of simu­la­tion that you can con­clude there won’t be a treach­er­ous turn in the real de­ploy­ment case.

Un­for­tu­nately, con­ver­gence for this kind of learn­ing is go­ing to be poor. Or­di­nar­ily in ma­chine learn­ing, good perfor­mance means good av­er­age-case perfor­mance. But a treach­er­ous turn is an “er­ror” which can be care­fully placed to do the most dam­age. We want to en­sure this doesn’t hap­pen.

The prob­lem is, in part, that some out­puts are much more im­por­tant than oth­ers. De­ploy­ment is more im­por­tant than train­ing, and cer­tain crit­i­cal or vuln­er­a­ble mo­ments dur­ing de­ploy­ment will be es­pe­cially im­por­tant. We want to be par­tic­u­larly sure to get im­por­tant things right, rather than just get­ting low av­er­age loss.

But we can’t solve this by tel­ling the sys­tem what’s im­por­tant. In­deed, it seems we hope it can’t figure that out—we are bank­ing on be­ing able to gen­er­al­ize from perfor­mance on less-im­por­tant cases to more-im­por­tant cases. This is why re­search into ML tech­niques which avoid rare catas­tro­phes (or “traps”) is rele­vant to the prob­lem of mak­ing sure mesa-op­ti­miz­ers are al­igned with base op­ti­miz­ers.

It is difficult to trust ar­bi­trary code—which is what mod­els from rich model classes are—based only on em­piri­cal test­ing. Con­sider a highly sim­plified prob­lem: we want to find a pro­gram which only ever out­puts . is a catas­trophic failure.

If we could ex­am­ine the code our­selves, this prob­lem would be easy. But the out­put of ma­chine learn­ing is of­ten difficult to an­a­lyze; so let’s sup­pose that we can’t un­der­stand code at all.

Now, in some sense, we can trust sim­pler func­tions more. A short piece of code is less likely to con­tain a hard-coded ex­cep­tion. Let’s quan­tify that.

Con­sider the set of all pro­grams of length . Some pro­grams will print for a long time, but then print . We’re try­ing to avoid that.

Call the time-to-first-zero . ( if the pro­gram is trust­wor­thy, i.e., if it never out­puts .)

The high­est finite out of all length- pro­grams is a form of the Busy Beaver func­tion, so I will re­fer to it as . If we wanted to be com­pletely sure that a ran­dom pro­gram of length were trust­wor­thy, we would need to ob­serve ones from that pro­gram.

Now, a fact about the Busy Beaver func­tion is that grows faster than any com­putable func­tion. So this kind of em­piri­cal trust-build­ing takes un­com­putably long to find the truth, in the worst case.

If we sup­pose all the other length- pro­grams are easy cases, there are ex­po­nen­tially many length- pro­grams, so the av­er­age is . But ex­po­nen­tials are com­putable. So still grows faster than any com­putable func­tion.

So while us­ing short pro­grams gives us some con­fi­dence in the­ory, the difficulty of form­ing gen­er­al­ized con­clu­sions about be­hav­ior grows ex­tremely quickly as a func­tion of length.

If length re­stric­tions aren’t so prac­ti­cal, per­haps re­strict­ing com­pu­ta­tional com­plex­ity can help us? In­tu­itively, a mesa-op­ti­mizer needs time to think in or­der to suc­cess­fully ex­e­cute a treach­er­ous turn. So a pro­gram which ar­rives at con­clu­sions more quickly might be more trust­wor­thy.

How­ever, re­strict­ing com­plex­ity class un­for­tu­nately doesn’t get around Busy-Beaver-type be­hav­ior. Strate­gies that wait a long time be­fore out­putting can be slowed down even fur­ther with only slightly longer pro­gram length .

If all of these prob­lems seem too hy­po­thet­i­cal, con­sider the evolu­tion of life on Earth. Evolu­tion can be thought of as a re­pro­duc­tive fit­ness max­i­mizer.

(Evolu­tion can ac­tu­ally be thought of as an op­ti­mizer for many things, or as no op­ti­mizer at all, but that doesn’t mat­ter. The point is that if an agent wanted to max­i­mize re­pro­duc­tive fit­ness, it might use a sys­tem that looked like evolu­tion.)

In­tel­li­gent or­ganisms are mesa-op­ti­miz­ers of evolu­tion. Although the drives of in­tel­li­gent or­ganisms are cer­tainly cor­re­lated with re­pro­duc­tive fit­ness, or­ganisms want all sorts of things. There are even mesa-op­ti­miz­ers who have come to un­der­stand evolu­tion, and even to ma­nipu­late it at times. Pow­er­ful and mis­al­igned mesa-op­ti­miz­ers ap­pear to be a real pos­si­bil­ity, then, at least with enough pro­cess­ing power.

Prob­lems seem to arise be­cause you try to solve a prob­lem which you don’t yet know how to solve by search­ing over a large space and hop­ing “some­one” can solve it.

If the source of the is­sue is the solu­tion of prob­lems by mas­sive search, per­haps we should look for differ­ent ways to solve prob­lems. Per­haps we should solve prob­lems by figur­ing things out. But how do you solve prob­lems which you don’t yet know how to solve other than by try­ing things?

Let’s take a step back.

Embed­ded world-mod­els is about how to think at all, as an em­bed­ded agent; de­ci­sion the­ory is about how to act. Ro­bust del­e­ga­tion is about build­ing trust­wor­thy suc­ces­sors and helpers. Sub­sys­tem al­ign­ment is about build­ing one agent out of trust­wor­thy parts.

The prob­lem is that:

• We don’t know how to think about en­vi­ron­ments when we’re smaller.

• To the ex­tent we can do that, we don’t know how to think about con­se­quences of ac­tions in those en­vi­ron­ments.

• Even when we can do that, we don’t know how to think about what we want.

• Even when we have none of these prob­lems, we don’t know how to re­li­ably out­put ac­tions which get us what we want!

### 6. Con­clud­ing thoughts

A fi­nal word on cu­ri­os­ity, and in­tel­lec­tual puz­zles:

I de­scribed an em­bed­ded agent, Emmy, and said that I don’t un­der­stand how she eval­u­ates her op­tions, mod­els the world, mod­els her­self, or de­com­poses and solves prob­lems.

In the past, when re­searchers have talked about mo­ti­va­tions for work­ing on prob­lems like these, they’ve gen­er­ally fo­cused on the mo­ti­va­tion from AI risk. AI re­searchers want to build ma­chines that can solve prob­lems in the gen­eral-pur­pose fash­ion of a hu­man, and du­al­ism is not a re­al­is­tic frame­work for think­ing about such sys­tems. In par­tic­u­lar, it’s an ap­prox­i­ma­tion that’s es­pe­cially prone to break­ing down as AI sys­tems get smarter. When peo­ple figure out how to build gen­eral AI sys­tems, we want those re­searchers to be in a bet­ter po­si­tion to un­der­stand their sys­tems, an­a­lyze their in­ter­nal prop­er­ties, and be con­fi­dent in their fu­ture be­hav­ior.

This is the mo­ti­va­tion for most re­searchers to­day who are work­ing on things like up­date­less de­ci­sion the­ory and sub­sys­tem al­ign­ment. We care about ba­sic con­cep­tual puz­zles which we think we need to figure out in or­der to achieve con­fi­dence in fu­ture AI sys­tems, and not have to rely quite so much on brute-force search or trial and er­ror.

But the ar­gu­ments for why we may or may not need par­tic­u­lar con­cep­tual in­sights in AI are pretty long. I haven’t tried to wade into the de­tails of that de­bate here. In­stead, I’ve been dis­cussing a par­tic­u­lar set of re­search di­rec­tions as an in­tel­lec­tual puz­zle, and not as an in­stru­men­tal strat­egy.

One down­side of dis­cussing these prob­lems as in­stru­men­tal strate­gies is that it can lead to some mi­s­un­der­stand­ings about why we think this kind of work is so im­por­tant. With the “in­stru­men­tal strate­gies” lens, it’s tempt­ing to draw a di­rect line from a given re­search prob­lem to a given safety con­cern. But it’s not that I’m imag­in­ing real-world em­bed­ded sys­tems be­ing “too Bayesian” and this some­how caus­ing prob­lems, if we don’t figure out what’s wrong with cur­rent mod­els of ra­tio­nal agency. It’s cer­tainly not that I’m imag­in­ing fu­ture AI sys­tems be­ing writ­ten in sec­ond-or­der logic! In most cases, I’m not try­ing at all to draw di­rect lines be­tween re­search prob­lems and spe­cific AI failure modes.

What I’m in­stead think­ing about is this: We sure do seem to be work­ing with the wrong ba­sic con­cepts to­day when we try to think about what agency is, as seen by the fact that these con­cepts don’t trans­fer well to the more re­al­is­tic em­bed­ded frame­work.

If AI de­vel­op­ers in the fu­ture are still work­ing with these con­fused and in­com­plete ba­sic con­cepts as they try to ac­tu­ally build pow­er­ful real-world op­ti­miz­ers, that seems like a bad po­si­tion to be in. And it seems like the re­search com­mu­nity is un­likely to figure most of this out by de­fault in the course of just try­ing to de­velop more ca­pa­ble sys­tems. Evolu­tion cer­tainly figured out how to build hu­man brains with­out “un­der­stand­ing” any of this, via brute-force search.

Embed­ded agency is my way of try­ing to point at what I think is a very im­por­tant and cen­tral place where I feel con­fused, and where I think fu­ture re­searchers risk run­ning into con­fu­sions too.

There’s also a lot of ex­cel­lent AI al­ign­ment re­search that’s be­ing done with an eye to­ward more di­rect ap­pli­ca­tions; but I think of that safety re­search as hav­ing a differ­ent type sig­na­ture than the puz­zles I’ve talked about here.

In­tel­lec­tual cu­ri­os­ity isn’t the ul­ti­mate rea­son we priv­ilege these re­search di­rec­tions. But there are some prac­ti­cal ad­van­tages to ori­ent­ing to­ward re­search ques­tions from a place of cu­ri­os­ity at times, as op­posed to only ap­ply­ing the “prac­ti­cal im­pact” lens to how we think about the world.

When we ap­ply the cu­ri­os­ity lens to the world, we ori­ent to­ward the sources of con­fu­sion pre­vent­ing us from see­ing clearly; the blank spots in our map, the flaws in our lens. It en­courages re-check­ing as­sump­tions and at­tend­ing to blind spots, which is helpful as a psy­cholog­i­cal coun­ter­point to our “in­stru­men­tal strat­egy” lens—the lat­ter be­ing more vuln­er­a­ble to the urge to lean on what­ever shaky premises we have on hand so we can get to more solidity and clo­sure in our early think­ing.

Embed­ded agency is an or­ga­niz­ing theme be­hind most, if not all, of our big cu­ri­osi­ties. It seems like a cen­tral mys­tery un­der­ly­ing many con­crete difficul­ties.

Bibliography

No nominations.
No reviews.
• The above is the full Embed­ded Agency se­quence, cross-posted from the MIRI web­site so that it’s eas­ier to find the text ver­sion on AIAF/​LW (via search, se­quences, au­thor pages, etc.).

Scott and Abram have added a new sec­tion on self-refer­ence to the se­quence since it was first posted, and slightly ex­panded the sub­se­quent sec­tion on log­i­cal un­cer­tainty and the start of the ro­bust del­e­ga­tion sec­tion.

• Pro­moted to cu­rated: I think the con­tent of this se­quence is quite im­por­tant, both for ra­tio­nal­ity and AI Align­ment. I also quite ap­pre­ci­ate the care that went into the pre­sen­ta­tion, and think the whole text is a prime ex­am­ple in a text that is re­ally fo­cus­ing on ex­plain­ing things, in­stead of try­ing to per­suade the reader of a con­clu­sion.

I also think the last sec­tion gen­er­al­izes quite well to do­mains other than AI Align­ment. I think a lot of the best sci­ence looks like look­ing for fun­da­men­tal con­fu­sions, in the way this se­quence is do­ing it, and I would love to see more posts in it style for do­mains like Psy­chol­ogy, Eco­nomics and in­di­vi­d­ual ra­tio­nal­ity.

• I’m pretty im­pressed by this, and es­pe­cially the con­tent on em­bed­ded agents causes me to up­date in the di­rec­tion of think­ing MIRI re­searchers are less con­fused about cer­tain is­sues of episte­mol­ogy than I pre­vi­ously thought. I would have framed some of these is­sues differ­ently, but over­all I can com­plain far less than I have in the past based on what I’ve read here.

• Con­cern­ing the 5 and 10 prob­lem—I’m cu­ri­ous if any works been done try­ing to re­solve this by us­ing a weaker logic? I’m not a lo­gi­cian, but a rele­vance logic seems worth look­ing into. At least on the face of it, tak­ing away the prin­ci­ple of ex­plo­sion is a step to­wards mak­ing the men­tioned P, “if the agent out­puts 5 the uni­verse out­puts 5, and if the agent out­puts 10 the uni­verse out­puts 0” un­prov­able.

I’d be in­ter­ested in any other work on the 5 and 10 prob­lem also.

• You may want to add MIRI’s bot­world 1.0 pro­ject to the bibliog­ra­phy, so that peo­ple look­ing into this don’t du­pli­cate the idea

• Peo­ple of­ten try to solve the prob­lem of coun­ter­fac­tu­als by sug­gest­ing that there will always be some un­cer­tainty. An AI may know its source code perfectly, but it can’t perfectly know the hard­ware it is run­ning on.

How could Emmy, an em­bed­ded agent, know its source code perfectly, or even be cer­tain that it is a com­put­ing de­vice un­der the Church-Tur­ing defi­ni­tion? Such cer­tainty would seem dog­matic. Without such cer­tainty, the choice of 10 rather than 5 can­not be firmly clas­sified as an er­ror. (The clas­sifi­ca­tion as an er­ror seemed to play an im­por­tant role in your dis­cus­sion.) So Emmy has a mo­ti­va­tion to keep look­ing and find that U(10)=10.

• Very sur­prised that Emmy is not treated as an agent driven by a (pre­dic­tive) model of causal re­la­tion­ships. How else could an em­bod­ied agent pos­si­bly func­tion? Also sur­prised that Pearl’s sem­i­nal work on Causal­ity (incl. Coun­ter­fac­tu­als) is not cited.

• [ ]
[deleted]