[AN #69] Stuart Russell’s new book on why we need to replace the standard model of AI

Link post

Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can sign up, or look through this spread­sheet of all sum­maries that have ever been in the newslet­ter. I’m always happy to hear feed­back; you can send it to me by re­ply­ing to this email.

This is a bonus newslet­ter sum­ma­riz­ing Stu­art Rus­sell’s new book, along with sum­maries of a few of the most rele­vant pa­pers. It’s en­tirely writ­ten by Ro­hin, so the usual “sum­ma­rized by” tags have been re­moved.

We’re also chang­ing the pub­lish­ing sched­ule: so far, we’ve aimed to send a newslet­ter ev­ery Mon­day; we’re now aiming to send a newslet­ter ev­ery Wed­nes­day.

Au­dio ver­sion here (may not be up yet).

Hu­man Com­pat­i­ble: Ar­tifi­cial In­tel­li­gence and the Prob­lem of Con­trol (Stu­art Rus­sell): Since I am aiming this sum­mary for peo­ple who are already fa­mil­iar with AI safety, my sum­mary is sub­stan­tially re­or­ga­nized from the book, and skips large por­tions of the book that I ex­pect will be less use­ful for this au­di­ence. If you are not fa­mil­iar with AI safety, note that I am skip­ping many ar­gu­ments and coun­ter­ar­gu­ments in the book that are aimed for you. I’ll re­fer to the book as “HC” in this newslet­ter.

Be­fore we get into de­tails of im­pacts and solu­tions to the prob­lem of AI safety, it’s im­por­tant to have a model of how AI de­vel­op­ment will hap­pen. Many es­ti­mates have been made by figur­ing out the amount of com­pute needed to run a hu­man brain, and figur­ing out how long it will be un­til we get there. HC doesn’t agree with these; it sug­gests the bot­tle­neck for AI is in the al­gorithms rather than the hard­ware. We will need sev­eral con­cep­tual break­throughs, for ex­am­ple in lan­guage or com­mon sense un­der­stand­ing, cu­mu­la­tive learn­ing (the ana­log of cul­tural ac­cu­mu­la­tion for hu­mans), dis­cov­er­ing hi­er­ar­chy, and man­ag­ing men­tal ac­tivity (that is, the metacog­ni­tion needed to pri­ori­tize what to think about next). It’s not clear how long these will take, and whether there will need to be more break­throughs af­ter these oc­cur, but these seem like nec­es­sary ones.

What could hap­pen if we do get benefi­cial su­per­in­tel­li­gent AI? While there is a lot of sci-fi spec­u­la­tion that we could do here, as a weak lower bound, it should at least be able to au­to­mate away al­most all ex­ist­ing hu­man la­bor. As­sum­ing that su­per­in­tel­li­gent AI is very cheap, most ser­vices and many goods would be­come ex­tremely cheap. Even many pri­mary prod­ucts such as food and nat­u­ral re­sources would be­come cheaper, as hu­man la­bor is still a sig­nifi­cant frac­tion of their pro­duc­tion cost. If we as­sume that this could bring up ev­ery­one’s stan­dard of life up to that of the 88th per­centile Amer­i­can, that would re­sult in nearly a ten­fold in­crease in world GDP per year. As­sum­ing a 5% dis­count rate per year, this cor­re­sponds to $13.5 quadrillion net pre­sent value. Such a gi­ant prize re­moves many rea­sons for con­flict, and should en­courage ev­ery­one to co­op­er­ate to en­sure we all get to keep this prize.

Of course, this doesn’t mean that there aren’t any prob­lems, even with AI that does what its owner wants. Depend­ing on who has ac­cess to pow­er­ful AI sys­tems, we could see a rise in au­to­mated surveillance, lethal au­tonomous weapons, au­to­mated black­mail, fake news and be­hav­ior ma­nipu­la­tion. Another is­sue that could come up is that once AI is bet­ter than hu­mans at all tasks, we may end up del­e­gat­ing ev­ery­thing to AI, and lose au­ton­omy, lead­ing to hu­man en­fee­ble­ment.

This all as­sumes that we are able to con­trol AI. How­ever, we should be cau­tious about such an en­deavor—if noth­ing else, we should be care­ful about cre­at­ing en­tities that are more in­tel­li­gent than us. After all, the go­rillas prob­a­bly aren’t too happy about the fact that their habitat, hap­piness, and ex­is­tence de­pends on our moods and whims. For this rea­son, HC calls this the go­rilla prob­lem: speci­fi­cally, “the prob­lem of whether hu­mans can main­tain their supremacy and au­ton­omy in a world that in­cludes ma­chines with sub­stan­tially greater in­tel­li­gence”. Of course, we aren’t in the same po­si­tion as the go­rillas: we get to de­sign the more in­tel­li­gent “species”. But we should prob­a­bly have some good ar­gu­ments ex­plain­ing why our de­sign isn’t go­ing to suc­cumb to the go­rilla prob­lem. This is es­pe­cially im­por­tant in the case of a fast in­tel­li­gence ex­plo­sion, or hard take­off, be­cause in that sce­nario we do not get any time to re­act and solve any prob­lems that arise.

Do we have such an ar­gu­ment right now? Not re­ally, and in fact there’s an ar­gu­ment that we will suc­cumb to the go­rilla prob­lem. The vast ma­jor­ity of re­search in AI and re­lated fields as­sumes that there is some definite, known speci­fi­ca­tion or ob­jec­tive that must be op­ti­mized. In RL, we op­ti­mize the re­ward func­tion; in search, we look for states match­ing a goal crite­rion; in statis­tics, we min­i­mize ex­pected loss; in con­trol the­ory, we min­i­mize the cost func­tion (typ­i­cally de­vi­a­tion from some de­sired be­hav­ior); in eco­nomics, we de­sign mechanisms and poli­cies to max­i­mize the util­ity of in­di­vi­d­u­als, welfare of groups, or profit of cor­po­ra­tions. This leads HC to pro­pose the fol­low­ing stan­dard model of ma­chine in­tel­li­gence: Machines are in­tel­li­gent to the ex­tent that their ac­tions can be ex­pected to achieve their ob­jec­tives. How­ever, if we put in the wrong ob­jec­tive, the ma­chine’s ob­sti­nate pur­suit of that ob­jec­tive would lead to out­comes we won’t like.

Con­sider for ex­am­ple the con­tent se­lec­tion al­gorithms used by so­cial me­dia, typ­i­cally max­i­miz­ing some mea­sure of en­gage­ment, like click-through. De­spite their lack of in­tel­li­gence, such al­gorithms end up chang­ing the user’s prefer­ence so that they be­come more pre­dictable, since more pre­dictable users can be given items they are more likely to click on. In prac­tice, this means that users are pushed to be­come more ex­treme in their poli­ti­cal views. Ar­guably, these al­gorithms have already caused much dam­age to the world.

So the prob­lem is that we don’t know how to put our ob­jec­tives in­side of the AI sys­tem so that when it op­ti­mizes its ob­jec­tive, the re­sults are good for us. Stu­art calls this the “King Mi­das” prob­lem: as the leg­end goes, King Mi­das wished that ev­ery­thing he touched would turn to gold, not re­al­iz­ing that “ev­ery­thing” in­cluded his daugh­ter and his food, a clas­sic case of a badly speci­fied ob­jec­tive (AN #1). In some sense, we’ve known about this prob­lem for a long time, both from King Mi­das’s tale, and in sto­ries about ge­nies, where the char­ac­ters in­evitably want to undo their wishes.

You might think that we could sim­ply turn off the power to the AI, but that won’t work, be­cause for al­most any definite goal, the AI has an in­cen­tive to stay op­er­a­tional, just be­cause that is nec­es­sary for it to achieve its goal. This is cap­tured in what may be Stu­art’s most fa­mous quote: you can’t fetch the coffee if you’re dead. This is one of a few wor­ri­some con­ver­gent in­stru­men­tal sub­goals.

What went wrong? The prob­lem was the way we eval­u­ated ma­chine in­tel­li­gence, which doesn’t take into ac­count the fact that ma­chines should be use­ful for us. HC pro­poses: Machines are benefi­cial to the ex­tent that their ac­tions can be ex­pected to achieve our ob­jec­tives. But with this defi­ni­tion, in­stead of our AI sys­tems op­ti­miz­ing a definite, wrong ob­jec­tive, they will also be un­cer­tain about the ob­jec­tive, since we our­selves don’t know what our ob­jec­tives are. HC ex­pands on this by propos­ing three prin­ci­ples for the de­sign of AI sys­tems, that I’ll quote here in full:

1. The ma­chine’s only ob­jec­tive is to max­i­mize the re­al­iza­tion of hu­man prefer­ences.

2. The ma­chine is ini­tially un­cer­tain about what those prefer­ences are.

3. The ul­ti­mate source of in­for­ma­tion about hu­man prefer­ences is hu­man be­hav­ior.

Co­op­er­a­tive In­verse Re­in­force­ment Learn­ing pro­vides a for­mal model of an as­sis­tance game that show­cases these prin­ci­ples. You might worry that an AI sys­tem that is un­cer­tain about its ob­jec­tive will not be as use­ful as one that knows the ob­jec­tive, but ac­tu­ally this un­cer­tainty is a fea­ture, not a bug: it leads to AI sys­tems that are defer­en­tial, that ask for clar­ify­ing in­for­ma­tion, and that try to learn hu­man prefer­ences. The Off-Switch Game shows that be­cause the AI is un­cer­tain about the re­ward, it will let it­self be shut off. Th­ese pa­pers are dis­cussed later in this newslet­ter.

So that’s the pro­posed solu­tion. You might worry that the pro­posed solu­tion is quite challeng­ing: af­ter all, it re­quires a shift in the en­tire way we do AI. What if the stan­dard model of AI can de­liver more re­sults, even if just be­cause more peo­ple work on it? Here, HC is op­ti­mistic: the big is­sue with the stan­dard model is that it is not very good at learn­ing our prefer­ences, and there’s a huge eco­nomic pres­sure to learn prefer­ences. For ex­am­ple, I would pay a lot of money for an AI as­sis­tant that ac­cu­rately learns my prefer­ences for meet­ing times, and sched­ules them com­pletely au­tonomously.

Another re­search challenge is how to ac­tu­ally put prin­ci­ple 3 into prac­tice: it re­quires us to con­nect hu­man be­hav­ior to hu­man prefer­ences. In­verse Re­ward De­sign and Prefer­ences Im­plicit in the State of the World (AN #45) are ex­am­ple pa­pers that tackle por­tions of this. How­ever, there are lots of sub­tleties in this con­nec­tion. We need to use Gricean se­man­tics for lan­guage: when we say X, we do not mean the literal mean­ing of X: the agent must also take into ac­count the fact that we both­ered to say X, and that we didn’t say Y. For ex­am­ple, I’m only go­ing to ask for the agent to buy a cup of coffee if I be­lieve that there is a place to buy rea­son­ably priced coffee nearby. If those be­liefs hap­pen to be wrong, the agent should ask for clar­ifi­ca­tion, rather than trudge hun­dreds of miles or pay hun­dreds of dol­lars to en­sure I get my cup of coffee.

Another prob­lem with in­fer­ring prefer­ences from be­hav­ior is that hu­mans are nearly always in some deeply nested plan, and many ac­tions don’t even oc­cur to us. Right now I’m writ­ing this sum­mary, and not con­sid­er­ing whether I should be­come a fire­man. I’m not writ­ing this sum­mary be­cause I just ran a calcu­la­tion show­ing that this would best achieve my prefer­ences, I’m do­ing it be­cause it’s a sub­part of the over­all plan of writ­ing this bonus newslet­ter, which it­self is a sub­part of other plans. The con­nec­tion to my prefer­ences is very far up. How do we deal with that fact?

There are per­haps more fun­da­men­tal challenges with the no­tion of “prefer­ences” it­self. For ex­am­ple, our ex­pe­rienc­ing self and our re­mem­ber­ing self may have differ­ent prefer­ences—if so, which one should our agent op­ti­mize for? In ad­di­tion, our prefer­ences of­ten change over time: should our agent op­ti­mize for our cur­rent prefer­ences, even if it knows that they will pre­dictably change in the fu­ture? This one could po­ten­tially be solved by learn­ing meta-prefer­ences that dic­tate what kinds of prefer­ence change pro­cesses are ac­cept­able.

All of these is­sues sug­gest that we need work across many fields (such as AI, cog­ni­tive sci­ence, psy­chol­ogy, and neu­ro­science) to re­verse-en­g­ineer hu­man cog­ni­tion, so that we can put prin­ci­ple 3 into ac­tion and cre­ate a model that shows how hu­man be­hav­ior arises from hu­man prefer­ences.

So far, we’ve been talk­ing about the case with a sin­gle hu­man. But of course, there are go­ing to be mul­ti­ple hu­mans: how do we deal with that? As a baseline, we could imag­ine that ev­ery hu­man gets their own agent that op­ti­mizes for their prefer­ences. How­ever, this will differ­en­tially benefit peo­ple who care less about other peo­ple’s welfare, since their agents have ac­cess to many po­ten­tial plans that wouldn’t be available to an agent for some­one who cared about other peo­ple. For ex­am­ple, if Har­riet was go­ing to be late for a meet­ing with Ivan, her AI agent might ar­range for Ivan to be even later.

What if we had laws that pre­vented AI sys­tems from act­ing in such an­ti­so­cial ways? It seems likely that su­per­in­tel­li­gent AI would be able to find loop­holes in such laws, so that they do things that are strictly le­gal but still an­ti­so­cial, e.g. line-cut­ting. (This prob­lem is similar to the prob­lem that we can’t just write down what we want and have AI op­ti­mize it.)

What if we made our AI sys­tems util­i­tar­ian (as­sum­ing we figured out some ac­cept­able method of com­par­ing util­ities across peo­ple)? Then we get the “So­ma­lia prob­lem”: agents will end up go­ing to So­ma­lia to help the worse-off peo­ple there, and so no one would ever buy such an agent.

Over­all, it’s not ob­vi­ous how we deal with the tran­si­tion from a sin­gle hu­man to mul­ti­ple hu­mans. While HC fo­cuses on a po­ten­tial solu­tion for the sin­gle hu­man /​ sin­gle agent case, there is still much more to be said and done to ac­count for the im­pact of AI on all of hu­man­ity. To quote HC, “There is re­ally no ana­log in our pre­sent world to the re­la­tion­ship we will have with benefi­cial in­tel­li­gent ma­chines in the fu­ture. It re­mains to be seen how the endgame turns out.”

Ro­hin’s opinion: I en­joyed read­ing this book; I don’t usu­ally get to read a sin­gle per­son’s over­all high-level view on the state of AI, how it could have so­cietal im­pact, the ar­gu­ment for AI risk, po­ten­tial solu­tions, and the need for AI gov­er­nance. It’s nice to see all of these ar­eas I think about tied to­gether into a sin­gle co­her­ent view. While I agree with much of the book, es­pe­cially the con­cep­tual switch from the stan­dard model of in­tel­li­gent ma­chines to Stu­art’s model of benefi­cial ma­chines, I’m go­ing to fo­cus on dis­agree­ments in this opinion.

First, the book has an im­plied stance to­wards the fu­ture of AI re­search that I don’t agree with: I could imag­ine that pow­er­ful AI sys­tems end up be­ing cre­ated by learn­ing alone with­out need­ing the con­cep­tual break­throughs that Stu­art out­lines. This has been pro­posed in e.g. AI-GAs (AN #63)), and seems to be the im­plicit be­lief that drives OpenAI and Deep­Mind’s re­search agen­das. This leads to differ­ences in risk anal­y­sis and solu­tions: for ex­am­ple, the in­ner al­ign­ment prob­lem (AN #58) only ap­plies to agents aris­ing from learn­ing al­gorithms, and I sus­pect would not ap­ply to Stu­art’s view of AI progress.

The book also gives the im­pres­sion that to solve AI safety, we sim­ply need to make sure that AI sys­tems are op­ti­miz­ing the right ob­jec­tive, at least in the case where there is a sin­gle hu­man and a sin­gle robot. Again, de­pend­ing on how fu­ture AI sys­tems work, that could be true, but I ex­pect there will be other prob­lems that need to be solved as well. I’ve already men­tioned in­ner al­ign­ment; other grad­u­ate stu­dents at CHAI work on e.g. ro­bust­ness and trans­parency.

The pro­posal for al­ign­ing AI re­quires us to build a model that re­lates hu­man prefer­ences to hu­man be­hav­ior. This sounds ex­tremely hard to get com­pletely right. Of course, we may not need a model that is com­pletely right: since re­ward un­cer­tainty makes the agent amenable to shut­downs, it seems plau­si­ble that we can cor­rect mis­takes in the model as they come up. But it’s not ob­vi­ous to me that this is suffi­cient.

The sec­tions on mul­ti­ple hu­mans are much more spec­u­la­tive and I have more dis­agree­ments there, but I ex­pect that is sim­ply be­cause we haven’t done enough re­search yet. For ex­am­ple, HC wor­ries that we won’t be able to use laws to pre­vent AIs from do­ing tech­ni­cally le­gal but still an­ti­so­cial things for the benefit of a sin­gle hu­man. This seems true if you imag­ine that a sin­gle hu­man sud­denly gets ac­cess to a su­per­in­tel­li­gent AI, but when ev­ery­one has a su­per­in­tel­li­gent AI, then the cur­rent sys­tem where hu­mans so­cially pe­nal­ize each other for norm vi­o­la­tions may scale up nat­u­rally. The over­all effect de­pends on whether AI makes it eas­ier to vi­o­late norms, or to de­tect and pun­ish norm vi­o­la­tions.

Read more: Max Teg­mark’s sum­mary, Alex Turner’s thoughts

AI Align­ment Pod­cast: Hu­man Com­pat­i­ble: Ar­tifi­cial In­tel­li­gence and the Prob­lem of Con­trol (Lu­cas Perry and Stu­art Rus­sell): This pod­cast cov­ers some of the main ideas from the book, which I’ll ig­nore for this sum­mary. It also talks a bit about the mo­ti­va­tions for the book. Stu­art has three au­di­ences in mind. He wants to ex­plain to laypeo­ple what AI is and why it mat­ters. He wants to con­vince AI re­searchers that they should be work­ing in this new model of benefi­cial AI that op­ti­mizes for our ob­jec­tives, rather than the stan­dard model of in­tel­li­gent AI that op­ti­mizes for its ob­jec­tives. Fi­nally, he wants to re­cruit aca­demics in other fields to help con­nect hu­man be­hav­ior to hu­man prefer­ences (prin­ci­ple 3), as well as to figure out how to deal with mul­ti­ple hu­mans.

Stu­art also points out that his book has two main differ­ences from Su­per­in­tel­li­gence and Life 3.0: first, his book ex­plains how ex­ist­ing AI tech­niques work (and in par­tic­u­lar it ex­plains the stan­dard model), and sec­ond, it pro­poses a tech­ni­cal solu­tion to the prob­lem (the three prin­ci­ples).

Co­op­er­a­tive In­verse Re­in­force­ment Learn­ing (Dy­lan Had­field-Menell et al): This pa­per pro­vides a for­mal­iza­tion of the three prin­ci­ples from the book, in the case where there is a sin­gle hu­man H and a sin­gle robot R. H and R are try­ing to op­ti­mize the same re­ward func­tion. Since both H and R are rep­re­sented in the en­vi­ron­ment, it can be the hu­man’s re­ward: that is, it is pos­si­ble to re­ward the state where the hu­man drinks coffee, with­out also re­ward­ing the state where the robot drinks coffee. This cor­re­sponds to the first prin­ci­ple: that ma­chines should op­ti­mize our ob­jec­tives. The sec­ond prin­ci­ple, that ma­chines should ini­tially be un­cer­tain about our ob­jec­tives, is in­cor­po­rated by as­sum­ing that only H knows the re­ward, re­quiring R to main­tain a be­lief over the re­ward. Fi­nally, for the third prin­ci­ple, R needs to get in­for­ma­tion about the re­ward from H’s be­hav­ior, and so R as­sumes that H will choose ac­tions that best op­ti­mize the re­ward (tak­ing into ac­count the fact that R doesn’t know the re­ward).

This defines a two-player game, origi­nally called a CIRL game but now called an as­sis­tance game. We can com­pute op­ti­mal joint strate­gies for H and R. Since this is an in­ter­ac­tive pro­cess, H can do bet­ter than just act­ing op­ti­mally as if R did not ex­ist (the as­sump­tion typ­i­cally made in IRL): H can teach R what the re­ward is. In ad­di­tion, R does not sim­ply pas­sively listen and then act, but in­ter­leaves learn­ing and act­ing, and so must man­age the ex­plore-ex­ploit trade­off.

See also Learn­ing to In­ter­ac­tively Learn and As­sist (AN #64), which is in­spired by this pa­per and does a similar thing with deep RL.

Read more: BAIR blog post: Co­op­er­a­tively Learn­ing Hu­man Values

The Off-Switch Game (Dy­lan Had­field-Menell et al): This pa­per stud­ies the­o­ret­i­cally the im­pact of un­cer­tainty over the re­ward on R’s in­cen­tives around po­ten­tial off switches. It pro­poses the sim­plest model that the au­thors ex­pect to lead to gen­er­al­iz­able re­sults. R and H are in an as­sis­tance game, in which R goes first. R may ei­ther take an ac­tion a, get­ting util­ity u, or shut it­self down, get­ting util­ity 0. In ei­ther case, the game ends im­me­di­ately. Alter­na­tively, R can choose to wait, in which case H can ei­ther shut down R, get­ting util­ity 0, or al­low R to go ahead with ac­tion a, get­ting util­ity u.

If H is perfectly ra­tio­nal, then wait­ing is always an op­ti­mal ac­tion for R, since H will en­sure that the team gets max(u, 0) util­ity. There can be other op­ti­mal ac­tions: if R is sure that u >= 0, then tak­ing ac­tion a is also op­ti­mal, and similarly if R is sure that u ⇐ 0, then shut­ting down is also op­ti­mal. How­ever, if H is not ra­tio­nal, and some­times fails to take the util­ity-max­i­miz­ing ac­tion (in a way R can’t pre­dict), then things get murk­ier. If R is sure about the value of u, then it is never op­ti­mal to wait, bet­ter to just take the ac­tion a (if u >= 0) or shut down (if u < 0) rather than let H screw it up. If R is pretty con­fi­dent that u is pos­i­tive, it may still de­cide to take ac­tion a, rather than risk that H makes the wrong de­ci­sion. How­ever, if R is very un­cer­tain about the sign of u, then wait­ing be­comes op­ti­mal again. In gen­eral, more un­cer­tainty over the re­ward leads to more defer­en­tial be­hav­ior (al­low­ing H to shut it off), but at a cost: R is much less able to help H when it is very un­cer­tain about the re­ward.

Ro­hin’s opinion: While I agree with the broad thrust of this pa­per, I do have one nit­pick: the game ends im­me­di­ately af­ter H chooses whether or not to shut off R. In re­al­ity, if R isn’t shut off, the as­sis­tance game will con­tinue, which changes the in­cen­tives. If R can be rel­a­tively con­fi­dent in the util­ity of some ac­tion (e.g. do­ing noth­ing), then it may be a bet­ter plan for it to dis­able the shut­down but­ton, and then take that ac­tion and ob­serve H in the mean time to learn the re­ward. Then, af­ter it has learned more about the re­ward and figured out why H wanted to shut it down, it can act well and get util­ity (rather than be­ing stuck with the zero util­ity from be­ing shut down). While this doesn’t seem great, it’s not ob­vi­ously bad: R ends up do­ing noth­ing un­til it can figure out how to ac­tu­ally be use­ful, hardly a catas­trophic out­come. Really bad out­comes only come if R ends up be­com­ing con­fi­dent in the wrong re­ward due to some kind of mis­speci­fi­ca­tion, as sug­gested in In­cor­rigi­bil­ity in the CIRL Frame­work, sum­ma­rized next.

In­cor­rigi­bil­ity in the CIRL Frame­work (Ryan Carey): This pa­per demon­strates that when the agent has an in­cor­rect be­lief about the hu­man’s re­ward func­tion, then you no longer get the benefit that the agent will obey shut­down in­struc­tions. It ar­gues that since the pur­pose of a shut­down but­ton is to func­tion as a safety mea­sure of last re­sort (when all other mea­sures have failed), it should not rely on an as­sump­tion that the agent’s be­lief about the re­ward is cor­rect.

Ro­hin’s opinion: I cer­tainly agree that if the agent is wrong in its be­liefs about the re­ward, then it is quite likely that it would not obey shut­down com­mands. For ex­am­ple, in the off switch game, if the agent is in­cor­rectly cer­tain that u is pos­i­tive, then it will take ac­tion a, even though the hu­man would want to shut it down. See also these (AN #32) posts (AN #32) on model mis­speci­fi­ca­tion and IRL. For a dis­cus­sion of how se­ri­ous the over­all cri­tique is, both from HC’s per­spec­tive and mine, see the opinion on the next post.

Prob­lem of fully up­dated defer­ence (Eliezer Yud­kowsky): This ar­ti­cle points out that even if you have an agent with un­cer­tainty over the re­ward func­tion, it will ac­quire in­for­ma­tion and re­duce its un­cer­tainty over the re­ward, un­til even­tu­ally it can’t re­duce un­cer­tainty any more, and then it would sim­ply op­ti­mize the ex­pec­ta­tion of the re­sult­ing dis­tri­bu­tion, which is equiv­a­lent to op­ti­miz­ing a known ob­jec­tive, and has the same is­sues (such as dis­abling shut­down but­tons).

Ro­hin’s opinion: As with the pre­vi­ous pa­per, this ar­gu­ment is only re­ally a prob­lem when the agent’s be­lief about the re­ward func­tion is wrong: if it is cor­rect, then at the point where there is no more in­for­ma­tion to gain, the agent should already know that hu­mans don’t like to be kil­led, do like to be happy, etc. and op­ti­miz­ing the ex­pec­ta­tion of the re­ward dis­tri­bu­tion should lead to good out­comes. Both this and the pre­vi­ous cri­tique are wor­ri­some when you can’t even put a rea­son­able prior over the re­ward func­tion, which is quite a strong claim.

HC’s re­sponse is that the agent should never as­sign zero prob­a­bil­ity to any hy­poth­e­sis. It sug­gests that you could have an ex­pand­able hi­er­ar­chi­cal prior, where ini­tially there are rel­a­tively sim­ple hy­pothe­ses, but as hy­pothe­ses be­come worse at ex­plain­ing the data, you “ex­pand” the set of hy­pothe­ses, ul­ti­mately bot­tom­ing out at (per­haps) the uni­ver­sal prior. I think that such an ap­proach could work in prin­ci­ple, and there are two challenges in prac­tice. First, it may not be com­pu­ta­tion­ally fea­si­ble to do this. Se­cond, it’s not clear how such an ap­proach can deal with the fact that hu­man prefer­ences change over time. (HC does want more re­search into both of these.)

Fully up­dated defer­ence could also be a prob­lem if the ob­ser­va­tion model used by the agent is in­cor­rect, rather than the prior. I’m not sure if this is part of the ar­gu­ment.

In­verse Re­ward De­sign (Dy­lan Had­field-Menell et al): Usu­ally, in RL, the re­ward func­tion is treated as the defi­ni­tion of op­ti­mal be­hav­ior, but this con­flicts with the third prin­ci­ple, which says that hu­man be­hav­ior is the ul­ti­mate source of in­for­ma­tion about hu­man prefer­ences. Nonethe­less, re­ward func­tions clearly have some in­for­ma­tion about our prefer­ences: how do we make it com­pat­i­ble with the third prin­ci­ple? We need to con­nect the re­ward func­tion to hu­man be­hav­ior some­how.

This pa­per pro­poses a sim­ple an­swer: since re­ward de­sign­ers usu­ally make re­ward func­tions through a pro­cess of trial-and-er­ror where they test their re­ward func­tions and see what they in­cen­tivize, the re­ward func­tion tells us about op­ti­mal be­hav­ior in the train­ing en­vi­ron­ment(s). The au­thors for­mal­ize this us­ing a Boltz­mann ra­tio­nal­ity model, where the re­ward de­signer is more likely to pick a proxy re­ward when it gives higher true re­ward in the train­ing en­vi­ron­ment (but it doesn’t mat­ter if the proxy re­ward be­comes de­cou­pled from the true re­ward in some test en­vi­ron­ment). With this as­sump­tion con­nect­ing the hu­man be­hav­ior (i.e. the proxy re­ward func­tion) to the hu­man prefer­ences (i.e. the true re­ward func­tion), they can then perform Bayesian in­fer­ence to get a pos­te­rior dis­tri­bu­tion over the true re­ward func­tion.

They demon­strate that by us­ing risk-averse plan­ning with re­spect to this pos­te­rior dis­tri­bu­tion, the agent can avoid nega­tive side effects that it has never seen be­fore and has no in­for­ma­tion about. For ex­am­ple, if the agent was trained to col­lect gold in an en­vi­ron­ment with dirt and grass, and then it is tested in an en­vi­ron­ment with lava, the agent will know that even though the speci­fied re­ward was in­differ­ent about lava, this doesn’t mean much, since any weight on lava would have led to the same be­hav­ior in the train­ing en­vi­ron­ment. Due to risk aver­sion, it con­ser­va­tively as­sumes that the lava is bad, and so suc­cess­fully avoids it.

See also Ac­tive In­verse Re­ward De­sign (AN #24), which builds on this work.

Ro­hin’s opinion: I re­ally like this pa­per as an ex­am­ple of how to ap­ply the third prin­ci­ple. This was the pa­per that caused me to start think­ing about how we should be think­ing about the as­sumed vs. ac­tual in­for­ma­tion con­tent in things (here, the key in­sight is that RL typ­i­cally as­sumes that the re­ward func­tion con­veys much more in­for­ma­tion than it ac­tu­ally does). That prob­a­bly in­fluenced the de­vel­op­ment of Prefer­ences Im­plicit in the State of the World (AN #45), which is also an ex­am­ple of the third prin­ci­ple and this in­for­ma­tion-based view­point, as it ar­gues that the state of the world is caused by hu­man be­hav­ior and so con­tains in­for­ma­tion about hu­man prefer­ences.

It’s worth not­ing that in this pa­per the lava avoidance is both due to the be­lief over the true re­ward, and the risk aver­sion. The agent would also avoid pots of gold in the test en­vi­ron­ment if it never saw it in the train­ing en­vi­ron­ment. IRD only gives you the cor­rect un­cer­tainty over the true re­ward; it doesn’t tell you how to use that un­cer­tainty. You would still need safe ex­plo­ra­tion, or some other source of in­for­ma­tion, if you want to re­duce the un­cer­tainty.