Approval-directed agents: overview

Note: This is the first post from part two: ba­sic in­tu­itions of the se­quence on iter­ated am­plifi­ca­tion. The sec­ond part of the se­quence out­lines the ba­sic in­tu­itions that mo­ti­vate iter­ated am­plifi­ca­tion. I think that these in­tu­itions may be more im­por­tant than the scheme it­self, but they are con­sid­er­ably more in­for­mal.

Re­search in AI is steadily pro­gress­ing to­wards more flex­ible, pow­er­ful, and au­tonomous goal-di­rected be­hav­ior. This progress is likely to have sig­nifi­cant eco­nomic and hu­man­i­tar­ian benefits: it helps make au­toma­tion faster, cheaper, and more effec­tive, and it al­lows us to au­to­mate de­cid­ing what to do.

Many re­searchers ex­pect goal-di­rected ma­chines to pre­dom­i­nate, and so have con­sid­ered the long-term im­pli­ca­tions of this kind of au­toma­tion. Some of these im­pli­ca­tions are wor­ry­ing: if so­phis­ti­cated ar­tifi­cial agents pur­sue their own ob­jec­tives and are as smart as we are, then the fu­ture may be shaped as much by their goals as by ours.

Most think­ing about “AI safety” has fo­cused on the pos­si­bil­ity of goal-di­rected ma­chines, and asked how we might en­sure that their goals are agree­able to hu­mans. But there are other pos­si­bil­ities.

In this post I will flesh out one al­ter­na­tive to goal-di­rected be­hav­ior. I think this idea is par­tic­u­larly im­por­tant from the per­spec­tive of AI safety.

Ap­proval-di­rected agents

Con­sider a hu­man Hugh, and an agent Arthur who uses the fol­low­ing pro­ce­dure to choose each ac­tion:

Es­ti­mate the ex­pected rat­ing Hugh would give each ac­tion if he con­sid­ered it at length. Take the ac­tion with the high­est ex­pected rat­ing.

I’ll call this “ap­proval-di­rected” be­hav­ior through­out this post, in con­trast with goal-di­rected be­hav­ior. In this con­text I’ll call Hugh an “over­seer.”

Arthur’s ac­tions are rated more highly than those pro­duced by any al­ter­na­tive pro­ce­dure. That’s com­fort­ing, but it doesn’t mean that Arthur is op­ti­mal. An op­ti­mal agent may make de­ci­sions that have con­se­quences Hugh would ap­prove of, even if Hugh can’t an­ti­ci­pate those con­se­quences him­self. For ex­am­ple, if Arthur is play­ing chess he should make moves that are ac­tu­ally good—not moves that Hugh thinks are good.

The qual­ity of ap­proval-di­rected de­ci­sions is limited by the min­i­mum of Arthur’s abil­ity and Hugh’s abil­ity: Arthur makes a de­ci­sion only if it looks good to both Arthur and Hugh. So why would Hugh be in­ter­ested in this pro­posal, rather than do­ing things him­self?

  • Hugh doesn’t ac­tu­ally rate ac­tions, he just par­ti­ci­pates in a hy­po­thet­i­cal rat­ing pro­cess. So Hugh can over­see many agents like Arthur at once (and spend his ac­tual time re­lax­ing on the beach). In many cases, this is the whole point of au­toma­tion.

  • Hugh can (hy­po­thet­i­cally) think for a very long time about each de­ci­sion—longer than would be prac­ti­cal or cost-effec­tive if he had to ac­tu­ally make the de­ci­sion him­self.

  • Similarly, Hugh can think about Arthur’s de­ci­sions at a very low level of de­tail. For ex­am­ple, Hugh might rate a chess-play­ing AI’s choices about how to ex­plore the game tree, rather than rat­ing its fi­nal choice of moves. If Arthur is mak­ing billions of small de­ci­sions each sec­ond, then Hugh can think in depth about each of them, and the re­sult­ing sys­tem can be much smarter than Hugh.

  • Hugh can (hy­po­thet­i­cally) use ad­di­tional re­sources in or­der to make his rat­ing: pow­er­ful com­put­ers, the benefit of hind­sight, many as­sis­tants, very long time pe­ri­ods.

  • Hugh’s ca­pa­bil­ities can be grad­u­ally es­ca­lated as needed, and one ap­proval-di­rected sys­tem can be used to boot­strap to a more effec­tive suc­ces­sor. For ex­am­ple, Arthur could ad­vise Hugh on how to define a bet­ter over­seer; Arthur could offer ad­vice in real-time to help Hugh be a bet­ter over­seer; or Arthur could di­rectly act as an over­seer for his more pow­er­ful suc­ces­sor.

In most situ­a­tions, I would ex­pect ap­proval-di­rected be­hav­ior to cap­ture the benefits of goal-di­rected be­hav­ior, while be­ing eas­ier to define and more ro­bust to er­rors.


Fa­cil­i­tate in­di­rect normativity

Ap­proval-di­rec­tion is closely re­lated to what Nick Bostrom calls “in­di­rect nor­ma­tivity” — de­scribing what is good in­di­rectly, by de­scribing how to tell what is good. I think this idea en­com­passes the most cred­ible pro­pos­als for defin­ing a pow­er­ful agent’s goals, but has some prac­ti­cal difficul­ties.

Ask­ing an over­seer to eval­u­ate out­comes di­rectly re­quires defin­ing an ex­tremely in­tel­li­gent over­seer, one who is equipped (at least in prin­ci­ple) to eval­u­ate the en­tire fu­ture of the uni­verse. This is prob­a­bly im­prac­ti­cal overkill for the kinds of agents we will be build­ing in the near fu­ture, who don’t have to think about the en­tire fu­ture of the uni­verse.

Ap­proval-di­rected be­hav­ior pro­vides a more re­al­is­tic al­ter­na­tive: start with sim­ple ap­proval-di­rected agents and sim­ple over­seers, and scale up the over­seer and the agent in par­allel. I ex­pect the ap­proval-di­rected dy­namic to con­verge to the de­sired limit; this re­quires only that the sim­ple over­seers ap­prove of scal­ing up to more pow­er­ful over­seers, and that they are able to rec­og­nize ap­pro­pri­ate im­prove­ments.

Avoid lock-in

Some ap­proaches to AI re­quire “lock­ing in” de­sign de­ci­sions. For ex­am­ple, if we build a goal-di­rected AI with the wrong goals then the AI might never cor­rect the mis­take on its own. For suffi­ciently so­phis­ti­cated AI’s, such mis­takes may be very ex­pen­sive to fix. There are also more sub­tle forms of lock-in: an AI may also not be able to fix a bad choice of de­ci­sion-the­ory, suffi­ciently bad pri­ors, or a bad at­ti­tude to­wards in­finity. It’s hard to know what other prop­er­ties we might in­ad­ver­tently lock-in.

Ap­proval-di­rec­tion in­volves only ex­tremely min­i­mal com­mit­ments. If an ap­proval-di­rected AI en­coun­ters an un­fore­seen situ­a­tion, it will re­spond in the way that we most ap­prove of. We don’t need to make a de­ci­sion un­til the situ­a­tion ac­tu­ally arises.

Per­haps most im­por­tantly, an ap­proval-di­rected agent can cor­rect flaws in its own de­sign, and will search for flaws if we want it to. It can change its own de­ci­sion-mak­ing pro­ce­dure, its own rea­son­ing pro­cess, and its own over­seer.

Fail gracefully

Ap­proval-di­rec­tion seems to “fail grace­fully:” if we slightly mess up the speci­fi­ca­tion, the ap­proval-di­rected agent prob­a­bly won’t be ac­tively mal­i­cious. For ex­am­ple, sup­pose that Hugh was feel­ing ex­tremely ap­a­thetic and so eval­u­ated pro­posed ac­tions only su­perfi­cially. The re­sult­ing agent would not ag­gres­sively pur­sue a flawed re­al­iza­tion of Hugh’s val­ues; it would just be­have lack­adaisi­cally. The mis­take would be quickly no­ticed, un­less Hugh de­liber­ately ap­proved of ac­tions that con­cealed the mis­take.

This looks like an im­prove­ment over mis­spec­i­fy­ing goals, which leads to sys­tems that are ac­tively op­posed to their users. Such sys­tems are mo­ti­vated to con­ceal pos­si­ble prob­lems and to be­have mal­i­ciously.

The same prin­ci­ple some­times ap­plies if you define the right over­seer but the agent rea­sons in­cor­rectly about it, if you mis­spec­ify the en­tire rat­ing pro­cess, or if your sys­tem doesn’t work quite like you ex­pect. Any of these mis­takes could be se­ri­ous for a goal-di­rected agent, but are prob­a­bly han­dled grace­fully by an ap­proval-di­rected agent.

Similarly, if Arthur is smarter than Hugh ex­pects, the only prob­lem is that Arthur won’t be able to use all of his in­tel­li­gence to de­vise ex­cel­lent plans. This is a se­ri­ous prob­lem, but it can be fixed by trial and er­ror—rather than lead­ing to sur­pris­ing failure modes.

Is it plau­si­ble?

I’ve already men­tioned the prac­ti­cal de­mand for goal-di­rected be­hav­ior and why I think that ap­proval-di­rected be­hav­ior satis­fies that de­mand. There are other rea­sons to think that agents might be goal-di­rected. Th­ese are all vari­a­tions on the same theme, so I apol­o­gize if my re­sponses be­come repet­i­tive.

In­ter­nal de­ci­sion-making

We as­sumed that Arthur can pre­dict what ac­tions Hugh will rate highly. But in or­der to make these pre­dic­tions, Arthur might use goal-di­rected be­hav­ior. For ex­am­ple, Arthur might perform a calcu­la­tion be­cause he be­lieves it will help him pre­dict what ac­tions Hugh will rate highly. Our ap­par­ently ap­proval-di­rected de­ci­sion-maker may have goals af­ter all, on the in­side. Can we avoid this?

I think so: Arthur’s in­ter­nal de­ci­sions could also be ap­proval-di­rected. Rather than perform­ing a calcu­la­tion be­cause it will help make a good pre­dic­tion, Arthur can perform that calcu­la­tion be­cause Hugh would rate this de­ci­sion highly. If Hugh is co­her­ent, then tak­ing in­di­vi­d­ual steps that Hugh rates highly leads to over­all be­hav­ior that Hugh would ap­prove of, just like tak­ing in­di­vi­d­ual steps that max­i­mize X leads to be­hav­ior that max­i­mizes X.

In fact the re­sult may be more de­sir­able, from Hugh’s per­spec­tive, than max­i­miz­ing Hugh’s ap­proval. For ex­am­ple, Hugh might in­cor­rectly rate some ac­tions highly, be­cause he doesn’t un­der­stand them. An agent max­i­miz­ing Hugh’s ap­proval might find those ac­tions and take them. But if the agent was in­ter­nally ap­proval-di­rected, then it wouldn’t try to ex­ploit er­rors in Hugh’s rat­ings. Ac­tions that lead to re­ported ap­proval but not real ap­proval, don’t lead to ap­proval for ap­proved reasons

Tur­tles all the way down?

Ap­proval-di­rec­tion stops mak­ing sense for low-level de­ci­sions. A pro­gram moves data from reg­ister A into reg­ister B be­cause that’s what the next in­struc­tion says, not be­cause that’s what Hugh would ap­prove of. After all, de­cid­ing whether Hugh would ap­prove it­self re­quires mov­ing data from one reg­ister to an­other, and we would be left with an in­finite regress.

The same thing is true for goal-di­rected be­hav­ior. Low-level ac­tions are taken be­cause the pro­gram­mer chose them. The pro­gram­mer may have cho­sen them be­cause she thought they would help the sys­tem achieve its goal, but the ac­tions them­selves are performed be­cause that’s what’s in the code, not be­cause of an ex­plicit be­lief that they will lead to the goal. Similarly, ac­tions might be performed be­cause a sim­ple heuris­tic sug­gests they will con­tribute to the goal — the heuris­tic was cho­sen or learned be­cause it was ex­pected to be use­ful for the goal, but the ac­tion is mo­ti­vated by the heuris­tic. Tak­ing the ac­tion doesn’t in­volve think­ing about the heuris­tic, just fol­low­ing it.

Similarly, an ap­proval-di­rected agent might perform an ac­tion be­cause it’s the next in­struc­tion in the pro­gram, or be­cause it’s recom­mended by a sim­ple heuris­tic. The pro­gram or heuris­tic might have been cho­sen to re­sult in ap­proved ac­tions, but the tak­ing the ac­tion doesn’t in­volve rea­son­ing about ap­proval. The ag­gre­gate effect of us­ing and re­fin­ing such heuris­tics is to effec­tively do what the user ap­proves of.

In many cases, per­haps a ma­jor­ity, the heuris­tics for goal-di­rected and ap­proval-di­rected be­hav­ior will co­in­cide. To an­swer “what do I want this func­tion to do next?” I very of­ten ask “what do I want the end re­sult to be?” In these cases the differ­ence is in how we think about the be­hav­ior of the over­all sys­tem, and what in­var­i­ants we try to main­tain as we de­sign it.

Rel­a­tive difficulty?

Ap­proval-di­rected sub­sys­tems might be harder to build than goal-di­rected sub­sys­tems. For ex­am­ple, there is much more data of the form “X leads to Y” than of the form “the user ap­proves of X.” This is a typ­i­cal AI prob­lem, though, and can be ap­proached us­ing typ­i­cal tech­niques.

Ap­proval-di­rected sub­sys­tems might also be eas­ier to build, and I think this is the case to­day. For ex­am­ple, I re­cently wrote a func­tion to de­cide which of two meth­ods to use for the next step of an op­ti­miza­tion. Right now it uses a sim­ple heuris­tic with mediocre perfor­mance. But I could also have la­beled some ex­am­ples as “use method A” or “use method B,” and trained a model to pre­dict what I would say. This model could then be used to de­cide when to use A, when to use B, and when to ask me for more train­ing data.

Reflec­tive stability

Ra­tional goal-di­rected be­hav­ior is re­flec­tively sta­ble: if you want X, you gen­er­ally want to con­tinue want­ing X. Can ap­proval-di­rected be­hav­ior have the same prop­erty?

Ap­proval-di­rected sys­tems in­herit re­flec­tive sta­bil­ity (or in­sta­bil­ity) from their over­seers. Hugh can de­ter­mine whether Arthur “wants” to re­main ap­proval-di­rected, by ap­prov­ing or dis­ap­prov­ing of ac­tions that would change Arthur’s de­ci­sion-mak­ing pro­cess.

Goal-di­rected agents want to be wiser and know more, though their goals are sta­ble. Ap­proval-di­rected agents also want to be wiser and know more, but they also want their over­seers to be wiser and know more. The over­seer is not sta­ble, but the over­seer’s val­ues are. This is a fea­ture, not a bug.

Similarly, an agent com­posed of ap­proval-di­rected sub­sys­tems over­seen by Hugh is not the same as an ap­proval-di­rected agent over­seen by Hugh. For ex­am­ple, the com­pos­ite may make de­ci­sions too sub­tle for Hugh to un­der­stand. Again, this is a fea­ture, not a bug.

Black box search

(Note: I no longer agree with the con­clu­sions of this sec­tion. I now feel that ap­proval-di­rected agents can prob­a­bly be con­structed out of pow­er­ful black-box search (or stochas­tic gra­di­ent de­scent); my main pri­or­ity is now ei­ther han­dling this set­ting or else un­der­stand­ing ex­actly what the ob­struc­tion is. On­go­ing work in this di­rec­tion is col­lected at ai-con­trol, and will hope­fully be pub­lished in a clear for­mat by the end of 2016.)

Some ap­proaches to AI prob­a­bly can’t yield ap­proval-di­rected agents. For ex­am­ple, we could perform a search which treats pos­si­ble agents as a black boxes and mea­sures their be­hav­ior for signs of in­tel­li­gence. Such a search could (even­tu­ally) find a hu­man-level in­tel­li­gence, but would give us very crude con­trol over how that in­tel­li­gence was ap­plied. We could get some kind of goal-di­rected be­hav­ior by se­lect­ing for it, but se­lect­ing for ap­proval-di­rected be­hav­ior would be difficult:

  1. The paucity of data on ap­proval is a huge prob­lem in this set­ting. (Note: semi-su­per­vised re­in­force­ment learn­ing is an ap­proach to this prob­lem.)

  2. You have no con­trol over the in­ter­nal be­hav­ior of the agent, which you would ex­pect to be op­ti­mized for pur­su­ing a par­tic­u­lar goal: max­i­miz­ing what­ever mea­sure of “ap­proval” that you used to guide your search. (Note: I no longer en­dorse this ar­gu­ment as writ­ten; re­ward en­g­ineer­ing is a re­sponse to the sub­stance of this con­cern.)

  3. Agents who max­i­mized your re­ported ap­proval in test cases need not do so in gen­eral, any more than hu­mans are re­li­able re­pro­duc­tive-fit­ness-max­i­miz­ers. (Note: red team­ing is an ap­proach to this prob­lem.)

But [1] and es­pe­cially [3] are also prob­lems when de­sign­ing a goal-di­rected agent with agree­able goals, or in­deed any par­tic­u­lar goals at all. Though ap­proval-di­rec­tion can’t deal with these prob­lems, they aren’t new prob­lems.

Such a black-box search—with lit­tle in­sight into the in­ter­nal struc­ture of the agents—seems wor­ry­ing no mat­ter how we ap­proach AI safety. For­tu­nately, it also seems un­likely (though not out of the ques­tion).

A similar search is more likely to be used to pro­duce in­ter­nal com­po­nents of a larger sys­tem (for ex­am­ple, you might train a neu­ral net­work to iden­tify ob­jects, as a com­po­nent of a sys­tem for nav­i­gat­ing an un­known en­vi­ron­ment). This pre­sents similar challenges, con­cern­ing ro­bust­ness and un­in­tended be­hav­iors, whether we are de­sign­ing a goal-di­rected or ap­proval-di­rected agent.

This es­say was origi­nally posted here on 1st De­cem­ber 2014. The sec­ond half of it can be found in the next post in this se­quence.

To­mor­row’s AI Align­ment Fo­rum se­quences post will be ‘Ap­proval-di­rected agents: “im­ple­men­ta­tion” de­tails’, by Paul Chris­ti­ano.