Approval-directed agents

Note: This is the first post from part two: ba­sic in­tu­itions of the se­quence on iter­ated am­plifi­ca­tion. The sec­ond part of the se­quence out­lines the ba­sic in­tu­itions that mo­ti­vate iter­ated am­plifi­ca­tion. I think that these in­tu­itions may be more im­por­tant than the scheme it­self, but they are con­sid­er­ably more in­for­mal.

Re­search in AI is steadily pro­gress­ing to­wards more flex­ible, pow­er­ful, and au­tonomous goal-di­rected be­hav­ior. This progress is likely to have sig­nifi­cant eco­nomic and hu­man­i­tar­ian benefits: it helps make au­toma­tion faster, cheaper, and more effec­tive, and it al­lows us to au­to­mate de­cid­ing what to do.

Many re­searchers ex­pect goal-di­rected ma­chines to pre­dom­i­nate, and so have con­sid­ered the long-term im­pli­ca­tions of this kind of au­toma­tion. Some of these im­pli­ca­tions are wor­ry­ing: if so­phis­ti­cated ar­tifi­cial agents pur­sue their own ob­jec­tives and are as smart as we are, then the fu­ture may be shaped as much by their goals as by ours.

Most think­ing about “AI safety” has fo­cused on the pos­si­bil­ity of goal-di­rected ma­chines, and asked how we might en­sure that their goals are agree­able to hu­mans. But there are other pos­si­bil­ities.

In this post I will flesh out one al­ter­na­tive to goal-di­rected be­hav­ior. I think this idea is par­tic­u­larly im­por­tant from the per­spec­tive of AI safety.

Ap­proval-di­rected agents

Con­sider a hu­man Hugh, and an agent Arthur who uses the fol­low­ing pro­ce­dure to choose each ac­tion:

Es­ti­mate the ex­pected rat­ing Hugh would give each ac­tion if he con­sid­ered it at length. Take the ac­tion with the high­est ex­pected rat­ing.

I’ll call this “ap­proval-di­rected” be­hav­ior through­out this post, in con­trast with goal-di­rected be­hav­ior. In this con­text I’ll call Hugh an “over­seer.”

Arthur’s ac­tions are rated more highly than those pro­duced by any al­ter­na­tive pro­ce­dure. That’s com­fort­ing, but it doesn’t mean that Arthur is op­ti­mal. An op­ti­mal agent may make de­ci­sions that have con­se­quences Hugh would ap­prove of, even if Hugh can’t an­ti­ci­pate those con­se­quences him­self. For ex­am­ple, if Arthur is play­ing chess he should make moves that are ac­tu­ally good—not moves that Hugh thinks are good.

The qual­ity of ap­proval-di­rected de­ci­sions is limited by the min­i­mum of Arthur’s abil­ity and Hugh’s abil­ity: Arthur makes a de­ci­sion only if it looks good to both Arthur and Hugh. So why would Hugh be in­ter­ested in this pro­posal, rather than do­ing things him­self?

  • Hugh doesn’t ac­tu­ally rate ac­tions, he just par­ti­ci­pates in a hy­po­thet­i­cal rat­ing pro­cess. So Hugh can over­see many agents like Arthur at once (and spend his ac­tual time re­lax­ing on the beach). In many cases, this is the whole point of au­toma­tion.

  • Hugh can (hy­po­thet­i­cally) think for a very long time about each de­ci­sion—longer than would be prac­ti­cal or cost-effec­tive if he had to ac­tu­ally make the de­ci­sion him­self.

  • Similarly, Hugh can think about Arthur’s de­ci­sions at a very low level of de­tail. For ex­am­ple, Hugh might rate a chess-play­ing AI’s choices about how to ex­plore the game tree, rather than rat­ing its fi­nal choice of moves. If Arthur is mak­ing billions of small de­ci­sions each sec­ond, then Hugh can think in depth about each of them, and the re­sult­ing sys­tem can be much smarter than Hugh.

  • Hugh can (hy­po­thet­i­cally) use ad­di­tional re­sources in or­der to make his rat­ing: pow­er­ful com­put­ers, the benefit of hind­sight, many as­sis­tants, very long time pe­ri­ods.

  • Hugh’s ca­pa­bil­ities can be grad­u­ally es­ca­lated as needed, and one ap­proval-di­rected sys­tem can be used to boot­strap to a more effec­tive suc­ces­sor. For ex­am­ple, Arthur could ad­vise Hugh on how to define a bet­ter over­seer; Arthur could offer ad­vice in real-time to help Hugh be a bet­ter over­seer; or Arthur could di­rectly act as an over­seer for his more pow­er­ful suc­ces­sor.

In most situ­a­tions, I would ex­pect ap­proval-di­rected be­hav­ior to cap­ture the benefits of goal-di­rected be­hav­ior, while be­ing eas­ier to define and more ro­bust to er­rors.


Fa­cil­i­tate in­di­rect normativity

Ap­proval-di­rec­tion is closely re­lated to what Nick Bostrom calls “in­di­rect nor­ma­tivity” — de­scribing what is good in­di­rectly, by de­scribing how to tell what is good. I think this idea en­com­passes the most cred­ible pro­pos­als for defin­ing a pow­er­ful agent’s goals, but has some prac­ti­cal difficul­ties.

Ask­ing an over­seer to eval­u­ate out­comes di­rectly re­quires defin­ing an ex­tremely in­tel­li­gent over­seer, one who is equipped (at least in prin­ci­ple) to eval­u­ate the en­tire fu­ture of the uni­verse. This is prob­a­bly im­prac­ti­cal overkill for the kinds of agents we will be build­ing in the near fu­ture, who don’t have to think about the en­tire fu­ture of the uni­verse.

Ap­proval-di­rected be­hav­ior pro­vides a more re­al­is­tic al­ter­na­tive: start with sim­ple ap­proval-di­rected agents and sim­ple over­seers, and scale up the over­seer and the agent in par­allel. I ex­pect the ap­proval-di­rected dy­namic to con­verge to the de­sired limit; this re­quires only that the sim­ple over­seers ap­prove of scal­ing up to more pow­er­ful over­seers, and that they are able to rec­og­nize ap­pro­pri­ate im­prove­ments.

Avoid lock-in

Some ap­proaches to AI re­quire “lock­ing in” de­sign de­ci­sions. For ex­am­ple, if we build a goal-di­rected AI with the wrong goals then the AI might never cor­rect the mis­take on its own. For suffi­ciently so­phis­ti­cated AI’s, such mis­takes may be very ex­pen­sive to fix. There are also more sub­tle forms of lock-in: an AI may also not be able to fix a bad choice of de­ci­sion-the­ory, suffi­ciently bad pri­ors, or a bad at­ti­tude to­wards in­finity. It’s hard to know what other prop­er­ties we might in­ad­ver­tently lock-in.

Ap­proval-di­rec­tion in­volves only ex­tremely min­i­mal com­mit­ments. If an ap­proval-di­rected AI en­coun­ters an un­fore­seen situ­a­tion, it will re­spond in the way that we most ap­prove of. We don’t need to make a de­ci­sion un­til the situ­a­tion ac­tu­ally arises.

Per­haps most im­por­tantly, an ap­proval-di­rected agent can cor­rect flaws in its own de­sign, and will search for flaws if we want it to. It can change its own de­ci­sion-mak­ing pro­ce­dure, its own rea­son­ing pro­cess, and its own over­seer.

Fail gracefully

Ap­proval-di­rec­tion seems to “fail grace­fully:” if we slightly mess up the speci­fi­ca­tion, the ap­proval-di­rected agent prob­a­bly won’t be ac­tively mal­i­cious. For ex­am­ple, sup­pose that Hugh was feel­ing ex­tremely ap­a­thetic and so eval­u­ated pro­posed ac­tions only su­perfi­cially. The re­sult­ing agent would not ag­gres­sively pur­sue a flawed re­al­iza­tion of Hugh’s val­ues; it would just be­have lack­adaisi­cally. The mis­take would be quickly no­ticed, un­less Hugh de­liber­ately ap­proved of ac­tions that con­cealed the mis­take.

This looks like an im­prove­ment over mis­spec­i­fy­ing goals, which leads to sys­tems that are ac­tively op­posed to their users. Such sys­tems are mo­ti­vated to con­ceal pos­si­ble prob­lems and to be­have mal­i­ciously.

The same prin­ci­ple some­times ap­plies if you define the right over­seer but the agent rea­sons in­cor­rectly about it, if you mis­spec­ify the en­tire rat­ing pro­cess, or if your sys­tem doesn’t work quite like you ex­pect. Any of these mis­takes could be se­ri­ous for a goal-di­rected agent, but are prob­a­bly han­dled grace­fully by an ap­proval-di­rected agent.

Similarly, if Arthur is smarter than Hugh ex­pects, the only prob­lem is that Arthur won’t be able to use all of his in­tel­li­gence to de­vise ex­cel­lent plans. This is a se­ri­ous prob­lem, but it can be fixed by trial and er­ror—rather than lead­ing to sur­pris­ing failure modes.

Is it plau­si­ble?

I’ve already men­tioned the prac­ti­cal de­mand for goal-di­rected be­hav­ior and why I think that ap­proval-di­rected be­hav­ior satis­fies that de­mand. There are other rea­sons to think that agents might be goal-di­rected. Th­ese are all vari­a­tions on the same theme, so I apol­o­gize if my re­sponses be­come repet­i­tive.

In­ter­nal de­ci­sion-making

We as­sumed that Arthur can pre­dict what ac­tions Hugh will rate highly. But in or­der to make these pre­dic­tions, Arthur might use goal-di­rected be­hav­ior. For ex­am­ple, Arthur might perform a calcu­la­tion be­cause he be­lieves it will help him pre­dict what ac­tions Hugh will rate highly. Our ap­par­ently ap­proval-di­rected de­ci­sion-maker may have goals af­ter all, on the in­side. Can we avoid this?

I think so: Arthur’s in­ter­nal de­ci­sions could also be ap­proval-di­rected. Rather than perform­ing a calcu­la­tion be­cause it will help make a good pre­dic­tion, Arthur can perform that calcu­la­tion be­cause Hugh would rate this de­ci­sion highly. If Hugh is co­her­ent, then tak­ing in­di­vi­d­ual steps that Hugh rates highly leads to over­all be­hav­ior that Hugh would ap­prove of, just like tak­ing in­di­vi­d­ual steps that max­i­mize X leads to be­hav­ior that max­i­mizes X.

In fact the re­sult may be more de­sir­able, from Hugh’s per­spec­tive, than max­i­miz­ing Hugh’s ap­proval. For ex­am­ple, Hugh might in­cor­rectly rate some ac­tions highly, be­cause he doesn’t un­der­stand them. An agent max­i­miz­ing Hugh’s ap­proval might find those ac­tions and take them. But if the agent was in­ter­nally ap­proval-di­rected, then it wouldn’t try to ex­ploit er­rors in Hugh’s rat­ings. Ac­tions that lead to re­ported ap­proval but not real ap­proval, don’t lead to ap­proval for ap­proved reasons

Tur­tles all the way down?

Ap­proval-di­rec­tion stops mak­ing sense for low-level de­ci­sions. A pro­gram moves data from reg­ister A into reg­ister B be­cause that’s what the next in­struc­tion says, not be­cause that’s what Hugh would ap­prove of. After all, de­cid­ing whether Hugh would ap­prove it­self re­quires mov­ing data from one reg­ister to an­other, and we would be left with an in­finite regress.

The same thing is true for goal-di­rected be­hav­ior. Low-level ac­tions are taken be­cause the pro­gram­mer chose them. The pro­gram­mer may have cho­sen them be­cause she thought they would help the sys­tem achieve its goal, but the ac­tions them­selves are performed be­cause that’s what’s in the code, not be­cause of an ex­plicit be­lief that they will lead to the goal. Similarly, ac­tions might be performed be­cause a sim­ple heuris­tic sug­gests they will con­tribute to the goal — the heuris­tic was cho­sen or learned be­cause it was ex­pected to be use­ful for the goal, but the ac­tion is mo­ti­vated by the heuris­tic. Tak­ing the ac­tion doesn’t in­volve think­ing about the heuris­tic, just fol­low­ing it.

Similarly, an ap­proval-di­rected agent might perform an ac­tion be­cause it’s the next in­struc­tion in the pro­gram, or be­cause it’s recom­mended by a sim­ple heuris­tic. The pro­gram or heuris­tic might have been cho­sen to re­sult in ap­proved ac­tions, but the tak­ing the ac­tion doesn’t in­volve rea­son­ing about ap­proval. The ag­gre­gate effect of us­ing and re­fin­ing such heuris­tics is to effec­tively do what the user ap­proves of.

In many cases, per­haps a ma­jor­ity, the heuris­tics for goal-di­rected and ap­proval-di­rected be­hav­ior will co­in­cide. To an­swer “what do I want this func­tion to do next?” I very of­ten ask “what do I want the end re­sult to be?” In these cases the differ­ence is in how we think about the be­hav­ior of the over­all sys­tem, and what in­var­i­ants we try to main­tain as we de­sign it.

Rel­a­tive difficulty?

Ap­proval-di­rected sub­sys­tems might be harder to build than goal-di­rected sub­sys­tems. For ex­am­ple, there is much more data of the form “X leads to Y” than of the form “the user ap­proves of X.” This is a typ­i­cal AI prob­lem, though, and can be ap­proached us­ing typ­i­cal tech­niques.

Ap­proval-di­rected sub­sys­tems might also be eas­ier to build, and I think this is the case to­day. For ex­am­ple, I re­cently wrote a func­tion to de­cide which of two meth­ods to use for the next step of an op­ti­miza­tion. Right now it uses a sim­ple heuris­tic with mediocre perfor­mance. But I could also have la­beled some ex­am­ples as “use method A” or “use method B,” and trained a model to pre­dict what I would say. This model could then be used to de­cide when to use A, when to use B, and when to ask me for more train­ing data.

Reflec­tive stability

Ra­tional goal-di­rected be­hav­ior is re­flec­tively sta­ble: if you want X, you gen­er­ally want to con­tinue want­ing X. Can ap­proval-di­rected be­hav­ior have the same prop­erty?

Ap­proval-di­rected sys­tems in­herit re­flec­tive sta­bil­ity (or in­sta­bil­ity) from their over­seers. Hugh can de­ter­mine whether Arthur “wants” to re­main ap­proval-di­rected, by ap­prov­ing or dis­ap­prov­ing of ac­tions that would change Arthur’s de­ci­sion-mak­ing pro­cess.

Goal-di­rected agents want to be wiser and know more, though their goals are sta­ble. Ap­proval-di­rected agents also want to be wiser and know more, but they also want their over­seers to be wiser and know more. The over­seer is not sta­ble, but the over­seer’s val­ues are. This is a fea­ture, not a bug.

Similarly, an agent com­posed of ap­proval-di­rected sub­sys­tems over­seen by Hugh is not the same as an ap­proval-di­rected agent over­seen by Hugh. For ex­am­ple, the com­pos­ite may make de­ci­sions too sub­tle for Hugh to un­der­stand. Again, this is a fea­ture, not a bug.

(Note: I no longer agree with the con­clu­sions of this sec­tion. I now feel that ap­proval-di­rected agents can prob­a­bly be con­structed out of pow­er­ful black-box search (or stochas­tic gra­di­ent de­scent); my main pri­or­ity is now ei­ther han­dling this set­ting or else un­der­stand­ing ex­actly what the ob­struc­tion is. On­go­ing work in this di­rec­tion is col­lected at ai-con­trol, and will hope­fully be pub­lished in a clear for­mat by the end of 2016.)

Some ap­proaches to AI prob­a­bly can’t yield ap­proval-di­rected agents. For ex­am­ple, we could perform a search which treats pos­si­ble agents as a black boxes and mea­sures their be­hav­ior for signs of in­tel­li­gence. Such a search could (even­tu­ally) find a hu­man-level in­tel­li­gence, but would give us very crude con­trol over how that in­tel­li­gence was ap­plied. We could get some kind of goal-di­rected be­hav­ior by se­lect­ing for it, but se­lect­ing for ap­proval-di­rected be­hav­ior would be difficult:

  1. The paucity of data on ap­proval is a huge prob­lem in this set­ting. (Note: semi-su­per­vised re­in­force­ment learn­ing is an ap­proach to this prob­lem.)

  2. You have no con­trol over the in­ter­nal be­hav­ior of the agent, which you would ex­pect to be op­ti­mized for pur­su­ing a par­tic­u­lar goal: max­i­miz­ing what­ever mea­sure of “ap­proval” that you used to guide your search. (Note: I no longer en­dorse this ar­gu­ment as writ­ten; re­ward en­g­ineer­ing is a re­sponse to the sub­stance of this con­cern.)

  3. Agents who max­i­mized your re­ported ap­proval in test cases need not do so in gen­eral, any more than hu­mans are re­li­able re­pro­duc­tive-fit­ness-max­i­miz­ers. (Note: red team­ing is an ap­proach to this prob­lem.)

But [1] and es­pe­cially [3] are also prob­lems when de­sign­ing a goal-di­rected agent with agree­able goals, or in­deed any par­tic­u­lar goals at all. Though ap­proval-di­rec­tion can’t deal with these prob­lems, they aren’t new prob­lems.

Such a black-box search—with lit­tle in­sight into the in­ter­nal struc­ture of the agents—seems wor­ry­ing no mat­ter how we ap­proach AI safety. For­tu­nately, it also seems un­likely (though not out of the ques­tion).

A similar search is more likely to be used to pro­duce in­ter­nal com­po­nents of a larger sys­tem (for ex­am­ple, you might train a neu­ral net­work to iden­tify ob­jects, as a com­po­nent of a sys­tem for nav­i­gat­ing an un­known en­vi­ron­ment). This pre­sents similar challenges, con­cern­ing ro­bust­ness and un­in­tended be­hav­iors, whether we are de­sign­ing a goal-di­rected or ap­proval-di­rected agent.

“Im­ple­men­ta­tion” details

So far I’ve talked about ap­proval-di­rec­tion im­pre­cisely. Maybe I’m talk­ing about some­thing in­co­her­ent, which has de­sir­able prop­er­ties only in the same sense as a four-sided tri­an­gle—vac­u­ously. I won’t re­ally be able to dis­pel this con­cern here, but I’ll at least take some steps.

How do you define ap­proval?

Even­tu­ally you would have to ac­tu­ally write code im­ple­ment­ing ap­proval-di­rected be­hav­ior. What might that code look like? I want to set aside the prob­lem “what does a so­phis­ti­cated AI look like?” since I ob­vi­ously don’t know. So let’s sup­pose we had some black box that did all of the hard work. I’ll con­sider a few cases for what the black box does, rang­ing from “easy to work with” to “very hard to work with.”

(Note: I now be­lieve that we can tar­get AI sys­tems trained (nearly) end-to-end with gra­di­ent de­scent, which is most similar to “learn­ing from ex­am­ples.”)

Nat­u­ral language

As an easy case, sup­pose we have a nat­u­ral lan­guage ques­tion-an­swer­ing sys­tem, which can as­sign a prob­a­bil­ity to any nat­u­ral lan­guage sen­tence. In this case, we ask the ques­tion:

“Sup­pose that Hugh un­der­stood the cur­rent situ­a­tion, was asked `on a scale from 0 to 1, how good is the ac­tion a?’ and was given a few hours to de­ter­mine his an­swer. What would his an­swer be?”

We then loop over each ac­tion a and take the ac­tion with the high­est ex­pected an­swer.

In this frame­work, it is easy to re­place Hugh by a more pow­er­ful over­seer—all you have to do is spec­ify the re­place­ment in nat­u­ral lan­guage.

“Math in­tu­ition mod­ule”

At an op­po­site ex­treme, sup­pose we have a “math in­tu­ition mod­ule,” a sys­tem which can as­sign prob­a­bil­ities only to perfectly pre­cise state­ments—per­haps of the form “al­gorithm A re­turns out­put y on in­put x.”

I’ve writ­ten about defin­ing “ap­proval upon re­flec­tion” al­gorith­mi­cally (see here, here). Th­ese defi­ni­tion can be used to define ap­proval-di­rected be­hav­ior com­pletely pre­cisely. I’m pretty hes­i­tant about these defi­ni­tions, but I do think it is promis­ing that we can get trac­tion even in such an ex­treme case.

In re­al­ity, I ex­pect the situ­a­tion to be some­where in be­tween the sim­ple case of nat­u­ral lan­guage and the hard case of math­e­mat­i­cal rigor. Nat­u­ral lan­guage is the case where we share all of our con­cepts with our ma­chines, while math­e­mat­ics is the case where we share only the most prim­i­tive con­cepts. In re­al­ity, I ex­pect we will share some but not all of our con­cepts, with vary­ing de­grees of ro­bust­ness. To the ex­tent that ap­proval-di­rected de­ci­sions are ro­bust to im­pre­ci­sion, we can safely use some more com­pli­cated con­cepts, rather than try­ing to define what we care about in terms of log­i­cal prim­i­tives.

Learn­ing from examples

In an even harder case, sup­pose we have a func­tion learner which can take some la­bel­led ex­am­ples f(x) = y and then pre­dict a new value f(x’). In this case we have to define “Hugh’s ap­proval” di­rectly via ex­am­ples. I feel less com­fortable with this case, but I’ll take a shot any­way.

In this case, our ap­proval-di­rected agent Arthur main­tains a prob­a­bil­is­tic model over se­quences ob­ser­va­tion[T] and ap­proval[T](a). At each step T, Arthur se­lects the ac­tion a max­i­miz­ing ap­proval[T](a). Then the timer T is in­cre­mented, and Arthur records ob­ser­va­tion[T+1] from his sen­sors. Op­tion­ally, Hugh might spec­ify a value ap­proval[t](a) for any time t and any ac­tion a’. Then Arthur up­dates his mod­els, and the pro­cess con­tinues.

Like AIXI, if Arthur is clever enough he even­tu­ally learns that ap­proval[T](a)refers to what­ever Hugh will retroac­tively in­put. But un­like AIXI, Arthur will make no effort to ma­nipu­late these judg­ments. In­stead he takes the ac­tion max­i­miz­ing his ex­pec­ta­tion of ap­proval[T] — i.e., his pre­dic­tion about what Hugh will say in the fu­ture, if Hugh says any­thing at all. (This de­pends on his self-pre­dic­tions, since what Hugh does in the fu­ture de­pends on what Arthur does now.)

At any rate, this is quite a lot bet­ter than AIXI, and it might turn out fine if you ex­er­cise ap­pro­pri­ate cau­tion. I wouldn’t want to use it in a high-stakes situ­a­tion, but I think that it is a promis­ing idea and that there are many nat­u­ral di­rec­tions for im­prove­ment. For ex­am­ple, we could provide fur­ther facts about ap­proval (be­yond ex­am­ple val­ues), in­ter­po­lat­ing con­tin­u­ously be­tween learn­ing from ex­am­ples and us­ing an ex­plicit defi­ni­tion of the ap­proval func­tion. More am­bi­tiously, we could im­ple­ment “ap­proval-di­rected learn­ing,” pre­vent­ing it from learn­ing com­pli­cated un­de­sired con­cepts.

How should Hugh rate?

So far I’ve been very vague about what Hugh should ac­tu­ally do when rat­ing an ac­tion. But the ap­proval-di­rected be­hav­ior de­pends on how Hugh de­cides to ad­minister ap­proval. How should Hugh de­cide?

If Hugh ex­pects ac­tion a to yield bet­ter con­se­quences than ac­tion b, then he should give ac­tion a a higher rat­ing than ac­tion b. In sim­ple en­vi­ron­ments he can sim­ply pick the best ac­tion, give it a rat­ing of 1, and give the other op­tions a rat­ing of 0.

If Arthur is so much smarter than Hugh that he knows ex­actly what Hugh will say, then we might as well stop here. In this case, ap­proval-di­rec­tion amounts to Arthur do­ing ex­actly what Hugh in­structs: “the min­i­mum of Arthur’s ca­pa­bil­ities and Hugh’s ca­pa­bil­ities” is equal to “Hugh’s ca­pa­bil­ities.”

But most of the time, Arthur won’t be able to tell ex­actly what Hugh will say. The nu­mer­i­cal scale be­tween 0 and 1 ex­ists to ac­co­mo­date Arthur’s un­cer­tainty.

To illus­trate the pos­si­ble prob­lems, sup­pose that Arthur is con­sid­er­ing whether to drive across a bridge that may or may not col­lapse. Arthur thinks the bridge will col­lapse with 1% prob­a­bil­ity. But Arthur also think that Hugh knows for sure whether or not the bridge will col­lapse. If Hugh always as­signed the op­ti­mal ac­tion a rat­ing of 1 and ev­ery other ac­tion a rat­ing of 0, then Arthur would take the ac­tion that was most likely to be op­ti­mal — driv­ing across the bridge.

Hugh should have done one of two things:

  • Give a bad rat­ing for risky be­hav­ior. Hugh should give Arthur a high rat­ing only if he drives across the bridge and knows that it is safe. In gen­eral, give a rat­ing of 1 to the best ac­tion ex ante.

  • As­sign a very bad rat­ing to in­cor­rectly driv­ing across the bridge, and only a small penalty for be­ing too cau­tious. In gen­eral, give rat­ings that re­flect the util­ities of pos­si­ble out­comes—to the ex­tent you know them.

Prob­a­bly Hugh should do both. This is eas­ier if Hugh un­der­stands what Arthur is think­ing and why, and what range of pos­si­bil­ities Arthur is con­sid­er­ing.

Other details

I am leav­ing out many other im­por­tant de­tails in the in­ter­est of brevity. For ex­am­ple:

  • In or­der to make these eval­u­a­tions Hugh might want to un­der­stand what Arthur is think­ing and why. This might be ac­com­plished by giv­ing Hugh enough time and re­sources to un­der­stand Arthur’s thoughts; or by let­ting differ­ent in­stances of Hugh “com­mu­ni­cate” to keep track of what is go­ing on as Arthur’s thoughts evolve; or by en­sur­ing that Arthur’s thoughts re­mains com­pre­hen­si­ble to Hugh (per­haps by us­ing ap­proval-di­rected be­hav­ior at a lower level, and only ap­prov­ing of in­ter­nal changes that can be ren­dered com­pre­hen­si­ble).

  • It is best if Hugh op­ti­mizes his rat­ings to en­sure the sys­tem re­mains ro­bust. For ex­am­ple, in high stakes set­tings, Hugh should some­times make Arthur con­sult the real Hugh to de­cide how to pro­ceed—even if Arthur cor­rectly knows what Hugh wants. This en­sures that Arthur will seek guidance when he in­cor­rectly be­lieves that he knows what Hugh wants.

…and so on. The de­tails I have in­cluded should be con­sid­ered illus­tra­tive at best. (I don’t want any­one to come away with a false sense of pre­ci­sion.)


It would be sloppy to end the post with­out a sam­pling of pos­si­ble pit­falls. For the most part these prob­lems have more se­vere analogs for goal-di­rected agents, but it’s still wise to keep them in mind when think­ing about ap­proval-di­rected agents in the con­text of AI safety.

My biggest concerns

I have three big con­cerns with ap­proval-di­rected agents, which are my pri­ori­ties for fol­low-up re­search:

  • Is an ap­proval-di­rected agent gen­er­ally as use­ful as a goal-di­rected agent, or does this re­quire the over­seer to be (ex­tremely) pow­er­ful? Based on the ideas in this post, I am cau­tiously op­ti­mistic.

  • Can we ac­tu­ally define ap­proval-di­rected agents by ex­am­ples, or do they already need a shared vo­cab­u­lary with their pro­gram­mers? I am again cau­tiously op­ti­mistic.

  • Is it re­al­is­tic to build an in­tel­li­gent ap­proval-di­rected agent with­out in­tro­duc­ing goal-di­rected be­hav­ior in­ter­nally? I think this is prob­a­bly the most im­por­tant fol­low-up ques­tion. I would guess that the an­swer will be “it de­pends on how AI plays out,” but we can at least get in­sight by ad­dress­ing the ques­tion in a va­ri­ety of con­crete sce­nar­ios.

Mo­ti­va­tional changes for the overseer

“What would I say if I thought for a very long time?” might have a sur­pris­ing an­swer. The very pro­cess of think­ing harder, or of find­ing my­self in a thought ex­per­i­ment, might al­ter my pri­ori­ties. I may care less about the real world, or may be­come con­vinced that I am liv­ing in a simu­la­tion.

This is a par­tic­u­larly se­vere prob­lem for my pro­posed im­ple­men­ta­tion of in­di­rect nor­ma­tivity, which in­volves a truly out­landish pro­cess of re­flec­tion. It’s still a pos­si­ble prob­lem for defin­ing ap­proval-di­rec­tion, but I think it is much less se­vere.

“What I would say af­ter a few hours,” is close enough to real life that I wouldn’t ex­pect my thought pro­cess to di­verge too far from re­al­ity, ei­ther in val­ues or be­liefs. Short time pe­ri­ods are much eas­ier to pre­dict, and give less time to ex­plore com­pletely unan­ti­ci­pated lines of thought. In prac­tice, I sus­pect we can also define some­thing like “what I would say af­ter a few hours of sit­ting at my desk un­der com­pletely nor­mal con­di­tions,” which looks par­tic­u­larly in­nocu­ous.

Over time we will build more pow­er­ful AI’s with more pow­er­ful (and per­haps more ex­otic) over­seers, but mak­ing these changes grad­u­ally is much eas­ier than mak­ing them all at once: small changes are more pre­dictable, and each suc­ces­sive change can be made with the help of in­creas­ingly pow­er­ful as­sis­tants.

Treach­er­ous turn

If Hugh in­ad­ver­tently speci­fies the wrong over­seer, then the re­sult­ing agent might be mo­ti­vated to de­ceive him. Any ra­tio­nal over­seer will be mo­ti­vated to ap­prove of ac­tions that look rea­son­able to Hugh. If they don’t, Hugh will no­tice the prob­lem and fix the bug, and the origi­nal over­seer will lose their in­fluence over the world.

This doesn’t seem like a big deal—a failed at­tempt to spec­ify “Hugh” prob­a­bly won’t in­ad­ver­tently spec­ify a differ­ent Hugh-level in­tel­li­gence, it will prob­a­bly fail in­nocu­ously.

There are some pos­si­ble ex­cep­tions, which mostly seem quite ob­scure but may be worth hav­ing in mind. The learn­ing-from-ex­am­ples pro­to­col seems par­tic­u­larly likely to have prob­lems. For ex­am­ple:

  • Some­one other than Hugh might be able to en­ter train­ing data for ap­proval[T](a). Depend­ing on how Arthur is defined, these ex­am­ples might in­fluence Arthur’s be­hav­ior as soon as Arthur ex­pects them to ap­pear. In the most patholog­i­cal case, these changes in Arthur’s be­hav­ior might have been the very rea­son that some­one had the op­por­tu­nity to en­ter fraud­u­lent train­ing data.

  • Arthur could ac­cept the mo­ti­vated simu­la­tion ar­gu­ment, be­liev­ing him­self to be in a simu­la­tion at the whim of a simu­la­tor at­tempt­ing to ma­nipu­late his be­hav­ior.

  • The sim­plest ex­pla­na­tion for Hugh’s judg­ments may be a sim­ple pro­gram mo­ti­vated to “mimic” the se­ries ap­proval[T] and ob­ser­va­tion[T] in or­der to in­fluence Arthur.


An ap­proval-di­rected agent may not be able to figure out what I ap­prove of.

I’m skep­ti­cal that this is a se­ri­ous prob­lem. It falls un­der the range of pre­dic­tive prob­lems I’d ex­pect a so­phis­ti­cated AI to be good at. So it’s a stan­dard ob­jec­tive for AI re­search, and AI’s that can’t make such pre­dic­tions prob­a­bly have sig­nifi­cantly sub-hu­man abil­ity to act in the world. More­over, even a fairly weak rea­soner can learn gen­er­al­iza­tions like “ac­tions that lead to Hugh get­ting candy, tend to be ap­proved of” or “ac­tions that take con­trol away from Hugh, tend to be dis­ap­proved of.”

If there is a prob­lem, it doesn’t seem like a se­ri­ous one. Straight­for­ward mi­s­un­der­stand­ings will lead to an agent that is in­ert rather than ac­tively mal­i­cious (see the “Fail grace­fully” sec­tion). And deep mi­s­un­der­stand­ings can be avoided, by Hugh ap­prov­ing of the de­ci­sion “con­sult Hugh.”


Mak­ing de­ci­sions by ask­ing “what ac­tion would your owner most ap­prove of?” may be more ro­bust than ask­ing “what out­come would your owner most ap­prove of?” Choos­ing ac­tions di­rectly has limi­ta­tions, but these might be over­come by a care­ful im­ple­men­ta­tion.

More gen­er­ally, the fo­cus on achiev­ing safe goal-di­rected be­hav­ior may have par­tially ob­scured the larger pur­pose of the AI safety com­mu­nity, which should be achiev­ing safe and use­ful be­hav­ior. It may turn out that goal-di­rected be­hav­ior re­ally is in­evitable or ir­re­place­able, but the case has not yet been set­tled.

This es­say was origi­nally posted here on 1st De­cem­ber 2014.

To­mor­row’s AI Align­ment Fo­rum se­quences post will be ‘Fixed Point Dis­cus­sion’ by Scott Garrabrant, in the se­quence ‘Fixed Points’.

The next posts in this se­quence will be ‘Ap­proval di­rected boot­strap­ping’ and ‘Hu­mans con­sult­ing HCH’, two short posts which will come out on Sun­day 25th Novem­ber.

No nominations.
No reviews.