AI safety without goal-directed behavior

When I first en­tered the field of AI safety, I thought of the prob­lem as figur­ing out how to get the AI to have the “right” util­ity func­tion. This led me to work on the prob­lem of in­fer­ring val­ues from demon­stra­tors with un­known bi­ases, de­spite the im­pos­si­bil­ity re­sults in the area. I am less ex­cited about that av­enue be­cause I am pes­simistic about the prospects of am­bi­tious value learn­ing (for the rea­sons given in the first part of this se­quence).

I think this hap­pened be­cause the writ­ing on AI risk that I en­coun­tered has the per­va­sive as­sump­tion that any su­per­in­tel­li­gent AI agent must be max­i­miz­ing some util­ity func­tion over the long term fu­ture, such that it leads to goal-di­rected be­hav­ior and con­ver­gent in­stru­men­tal sub­goals. It’s of­ten not stated as an as­sump­tion; rather, in­fer­ences are made as­sum­ing that you have the back­ground model that the AI is goal-di­rected. This makes it par­tic­u­larly hard to ques­tion the as­sump­tion, since you don’t re­al­ize that the as­sump­tion is even there.

Another rea­son that this as­sump­tion is so eas­ily ac­cepted is that we have a long his­tory of mod­el­ing ra­tio­nal agents as ex­pected util­ity max­i­miz­ers, and for good rea­son: there are many co­her­ence ar­gu­ments that say that, given that you have prefer­ences/​goals, if you aren’t us­ing prob­a­bil­ity the­ory and ex­pected util­ity the­ory, then you can be taken ad­van­tage of. It’s easy to make the in­fer­ence that a su­per­in­tel­li­gent agent must be ra­tio­nal, and there­fore it must be an ex­pected util­ity max­i­mizer.

Be­cause this as­sump­tion was so em­bed­ded in how I thought about the prob­lem, I had trou­ble imag­in­ing how else to even con­sider the prob­lem. I would guess this is true for at least some other peo­ple, so I want to sum­ma­rize the coun­ter­ar­gu­ment, and list a few im­pli­ca­tions, in the hope that this makes the is­sue clearer.

Why goal-di­rected be­hav­ior may not be required

The main ar­gu­ment of this chap­ter is that it is not re­quired that a su­per­in­tel­li­gent agent takes ac­tions in pur­suit of some goal. It is pos­si­ble to write al­gorithms that se­lect ac­tions with­out do­ing a search over the ac­tions and rat­ing their con­se­quences ac­cord­ing to an ex­plic­itly speci­fied sim­ple func­tion. There is no co­her­ence ar­gu­ment that says that your agent must have prefer­ences or goals; it is perfectly pos­si­ble for the agent to take ac­tions with no goal in mind sim­ply be­cause it was pro­grammed to do so; this re­mains true even when the agent is in­tel­li­gent.

It seems quite likely that by de­fault a su­per­in­tel­li­gent AI sys­tem would be goal-di­rected any­way, be­cause of eco­nomic effi­ciency ar­gu­ments. How­ever, this is not set in stone, as it would be if co­her­ence ar­gu­ments im­plied goal-di­rected be­hav­ior. Given the nega­tive re­sults around goal-di­rected be­hav­ior, it seems like the nat­u­ral path for­ward is to search for al­ter­na­tives that still al­low us to get eco­nomic effi­ciency.


At a high level, I think that the main im­pli­ca­tion of this view is that we should be con­sid­er­ing other mod­els for fu­ture AI sys­tems be­sides op­ti­miz­ing over the long term for a sin­gle goal or for a par­tic­u­lar util­ity or re­ward func­tion. Here are some other po­ten­tial mod­els:

  • Goal-con­di­tioned policy with com­mon sense: In this set­ting, hu­mans can set goals for the AI sys­tem sim­ply by ask­ing it in nat­u­ral lan­guage to do some­thing, and the AI sys­tem sets out to do it. How­ever, the AI also has “com­mon sense”, where it in­ter­prets our com­mands prag­mat­i­cally and not liter­ally: it’s not go­ing to pre­vent us from set­ting a new goal (which would stop it from achiev­ing its cur­rent goal), be­cause com­mon sense tells it that we don’t want it to do that. One way to think about this is to con­sider an AI sys­tem that in­fers and fol­lows hu­man norms, which are prob­a­bly much eas­ier to in­fer than hu­man val­ues (most hu­mans seem to in­fer norms very ac­cu­rately).

  • Cor­rigible AI: I’ll defer to Paul Chris­ti­ano’s ex­pla­na­tion of cor­rigi­bil­ity.

  • Com­pre­hen­sive AI Ser­vices (CAIS): Maybe we could cre­ate lots of AI ser­vices that in­ter­act with each other to solve hard prob­lems. Each in­di­vi­d­ual ser­vice could be bounded and epi­sodic, which im­me­di­ately means that it is no longer op­ti­miz­ing over the long term (though it could still be goal-di­rected). Per­haps we have a long-term plan­ner that is trained to pro­duce good plans to achieve par­tic­u­lar goals over the span of an hour, and a plan ex­ecu­tor that takes in a plan and ex­e­cutes the next step of the plan over an hour, and leaves in­struc­tions for the next steps.

There are ver­sions of these sce­nar­ios which are com­pat­i­ble with the frame­work of an AI sys­tem op­ti­miz­ing for a sin­gle goal:

  • A goal-con­di­tioned policy with com­mon sense could be op­er­a­tional­ized as op­ti­miz­ing for the goal of “fol­low­ing a hu­man’s or­ders with­out do­ing any­thing that hu­mans would re­li­ably judge as crazy”.

  • MIRI’s ver­sion of cor­rigi­bil­ity seems like it stays within this frame­work.

  • You could think of the ser­vices in CAIS as op­ti­miz­ing for the ag­gre­gate re­ward they get over all time, rather than just the re­ward they get dur­ing the cur­rent epi­sode.

I do not want these ver­sions of the sce­nar­ios, since they then make it tempt­ing to once again say “but if you get the goal even slightly wrong, then you’re in big trou­ble”. This would likely be true if we built an AI sys­tem that could max­i­mize an ar­bi­trary func­tion, and then tried to pro­gram in the util­ity func­tion we care about, but this is not re­quired. It seems pos­si­ble to build sys­tems in such a way that these prop­er­ties are in­her­ent in the way that they rea­son, such that it’s not even co­her­ent to ask what hap­pens if we “get the util­ity func­tion slightly wrong”.

Note that I’m not claiming that I know how to build such sys­tems; I’m just claiming that we don’t know enough yet to re­ject the pos­si­bil­ity that we could build such sys­tems. Given how hard it seems to be to al­ign sys­tems that ex­plic­itly max­i­mize a re­ward func­tion, we should ex­plore these other meth­ods as well.

Once we let go of the idea of op­ti­miz­ing for a sin­gle goal and it be­comes pos­si­ble to think about other ways in which we could build AI sys­tems, there are more in­sights about how we could build an AI sys­tem that does what we in­tend in­stead of what we say. (In my case it was re­versed—I heard a lot of good in­sights that don’t fit in the frame­work of goal-di­rected op­ti­miza­tion, and this even­tu­ally led me to let go of the as­sump­tion of goal-di­rected op­ti­miza­tion.) We’ll ex­plore some of these in the next chap­ter.

To­mor­row, there’ll be a break from AIAF se­quences and the new post will be the Align­ment Newslet­ter Is­sue #40, by Ro­hin Shah.

This se­quence will con­tinue with a pair of posts, ‘What is nar­row value learn­ing’ by Ro­hin Shah and ‘Am­bi­tious vs. nar­row value learn­ing’ by Paul Chris­ti­ano, on Wed­nes­day 9th Jan­uary.