Coherence arguments do not imply goal-directed behavior

One of the most pleas­ing things about prob­a­bil­ity and ex­pected util­ity the­ory is that there are many co­her­ence ar­gu­ments that sug­gest that these are the “cor­rect” ways to rea­son. If you de­vi­ate from what the the­ory pre­scribes, then you must be ex­e­cut­ing a dom­i­nated strat­egy. There must be some other strat­egy that never does any worse than your strat­egy, but does strictly bet­ter than your strat­egy with cer­tainty in at least one situ­a­tion. There’s a good ex­pla­na­tion of these ar­gu­ments here.

We shouldn’t ex­pect mere hu­mans to be able to no­tice any failures of co­her­ence in a su­per­in­tel­li­gent agent, since if we could no­tice these failures, so could the agent. So we should ex­pect that pow­er­ful agents ap­pear co­her­ent to us. (Note that it is pos­si­ble that the agent doesn’t fix the failures be­cause it would not be worth it—in this case, the ar­gu­ment says that we will not be able to no­tice any ex­ploitable failures.)

Taken to­gether, these ar­gu­ments sug­gest that we should model an agent much smarter than us as an ex­pected util­ity (EU) max­i­mizer. And many peo­ple agree that EU max­i­miz­ers are dan­ger­ous. So does this mean we’re doomed? I don’t think so: it seems to me that the prob­lems about EU max­i­miz­ers that we’ve iden­ti­fied are ac­tu­ally about goal-di­rected be­hav­ior or ex­plicit re­ward max­i­miz­ers. The co­her­ence the­o­rems say noth­ing about whether an AI sys­tem must look like one of these cat­e­gories. This sug­gests that we could try build­ing an AI sys­tem that can be mod­eled as an EU max­i­mizer, yet doesn’t fall into one of these two cat­e­gories, and so doesn’t have all of the prob­lems that we worry about.

Note that there are two differ­ent fla­vors of ar­gu­ments that the AI sys­tems we build will be goal-di­rected agents (which are dan­ger­ous if the goal is even slightly wrong):

  • Sim­ply know­ing that an agent is in­tel­li­gent lets us in­fer that it is goal-di­rected.

  • Hu­mans are par­tic­u­larly likely to build goal-di­rected agents.

I will only be ar­gu­ing against the first claim in this post, and will talk about the sec­ond claim in the next post.

All be­hav­ior can be ra­tio­nal­ized as EU maximization

Sup­pose we have ac­cess to the en­tire policy of an agent, that is, given any uni­verse-his­tory, we know what ac­tion the agent will take. Can we tell whether the agent is an EU max­i­mizer?

Ac­tu­ally, no mat­ter what the policy is, we can view the agent as an EU max­i­mizer. The con­struc­tion is sim­ple: the agent can be thought as op­ti­miz­ing the util­ity func­tion U, where U(h, a) = 1 if the policy would take ac­tion a given his­tory h, else 0. Here I’m as­sum­ing that U is defined over his­to­ries that are com­posed of states/​ob­ser­va­tions and ac­tions. The ac­tual policy gets 1 util­ity at ev­ery timestep; any other policy gets less than this, so the given policy perfectly max­i­mizes this util­ity func­tion. This con­struc­tion has been given be­fore, eg. at the bot­tom of page 6 of this pa­per. (I think I’ve seen it be­fore too, but I can’t re­mem­ber where.)

But wouldn’t this sug­gest that the VNM the­o­rem has no con­tent? Well, we as­sumed that we were look­ing at the policy of the agent, which led to a uni­verse-his­tory de­ter­minis­ti­cally. We didn’t have ac­cess to any prob­a­bil­ities. Given a par­tic­u­lar ac­tion, we knew ex­actly what the next state would be. Most of the ax­ioms of the VNM the­o­rem make refer­ence to lot­ter­ies and prob­a­bil­ities—if the world is de­ter­minis­tic, then the ax­ioms sim­ply say that the agent must have tran­si­tive prefer­ences over out­comes. Given that we can only ob­serve the agent choose one his­tory over an­other, we can triv­ially con­struct a tran­si­tive prefer­ence or­der­ing by say­ing that the cho­sen his­tory is higher in the prefer­ence or­der­ing than the one that was not cho­sen. This is es­sen­tially the con­struc­tion we gave above.

What then is the pur­pose of the VNM the­o­rem? It tells you how to be­have if you have prob­a­bil­is­tic be­liefs about the world, as well as a com­plete and con­sis­tent prefer­ence or­der­ing over out­comes. This turns out to be not very in­ter­est­ing when “out­comes” refers to “uni­verse-his­to­ries”. It can be more in­ter­est­ing when “out­comes” refers to world states in­stead (that is, snap­shots of what the world looks like at a par­tic­u­lar time), but util­ity func­tions over states/​snap­shots can’t cap­ture ev­ery­thing we’re in­ter­ested in, and there’s no rea­son to take as an as­sump­tion that an AI sys­tem will have a util­ity func­tion over states/​snap­shots.

There are no co­her­ence ar­gu­ments that say you must have goal-di­rected behavior

Not all be­hav­ior can be thought of as goal-di­rected (pri­mar­ily be­cause I al­lowed the cat­e­gory to be defined by fuzzy in­tu­itions rather than some­thing more for­mal). Con­sider the fol­low­ing ex­am­ples:

  • A robot that con­stantly twitches

  • The agent that always chooses the ac­tion that starts with the let­ter “A”

  • The agent that fol­lows the policy <policy> where for ev­ery his­tory the cor­re­spond­ing ac­tion in <policy> is gen­er­ated ran­domly.

Th­ese are not goal-di­rected by my “defi­ni­tion”. How­ever, they can all be mod­eled as ex­pected util­ity max­i­miz­ers, and there isn’t any par­tic­u­lar way that you can ex­ploit any of these agents. In­deed, it seems hard to model the twitch­ing robot or the policy-fol­low­ing agent as hav­ing any prefer­ences at all, so the no­tion of “ex­ploit­ing” them doesn’t make much sense.

You could ar­gue that nei­ther of these agents are in­tel­li­gent, and we’re only con­cerned with su­per­in­tel­li­gent AI sys­tems. I don’t see why these agents could not in prin­ci­ple be in­tel­li­gent: per­haps the agent knows how the world would evolve, and how to in­ter­vene on the world to achieve differ­ent out­comes, but it does not act on these be­liefs. Per­haps if we peered into the in­ner work­ings of the agent, we could find some part of it that al­lows us to pre­dict the fu­ture very ac­cu­rately, but it turns out that these in­ner work­ings did not af­fect the cho­sen ac­tion at all. Such an agent is in prin­ci­ple pos­si­ble, and it seems like it is in­tel­li­gent.

(If not, it seems as though you are defin­ing in­tel­li­gence to also be goal-driven, in which case I would frame my next post as ar­gu­ing that we may not want to build su­per­in­tel­li­gent AI, be­cause there are other things we could build that are as use­ful with­out the cor­re­spond­ing risks.)

You could ar­gue that while this is pos­si­ble in prin­ci­ple, no one would ever build such an agent. I whole­heart­edly agree, but note that this is now an ar­gu­ment based on par­tic­u­lar em­piri­cal facts about hu­mans (or per­haps agent-build­ing pro­cesses more gen­er­ally). I’ll talk about those in the next post; here I am sim­ply ar­gu­ing that merely know­ing that an agent is in­tel­li­gent, with no ad­di­tional em­piri­cal facts about the world, does not let you in­fer that it has goals.

As a corol­lary, since all be­hav­ior can be mod­eled as max­i­miz­ing ex­pected util­ity, but not all be­hav­ior is goal-di­rected, it is not pos­si­ble to con­clude that an agent is goal-driven if you only know that it can be mod­eled as max­i­miz­ing some ex­pected util­ity. How­ever, if you know that an agent is max­i­miz­ing the ex­pec­ta­tion of an ex­plic­itly rep­re­sented util­ity func­tion, I would ex­pect that to lead to goal-driven be­hav­ior most of the time, since the util­ity func­tion must be rel­a­tively sim­ple if it is ex­plic­itly rep­re­sented, and sim­ple util­ity func­tions seem par­tic­u­larly likely to lead to goal-di­rected be­hav­ior.

There are no co­her­ence ar­gu­ments that say you must have preferences

This sec­tion is an­other way to view the ar­gu­ment in the pre­vi­ous sec­tion, with “goal-di­rected be­hav­ior” now be­ing op­er­a­tional­ized as “prefer­ences”; it is not say­ing any­thing new.

Above, I said that the VNM the­o­rem as­sumes both that you use prob­a­bil­ities and that you have a prefer­ence or­der­ing over out­comes. There are lots of good rea­sons to as­sume that a good rea­soner will use prob­a­bil­ity the­ory. How­ever, there’s not much rea­son to as­sume that there is a prefer­ence or­der­ing over out­comes. The twitch­ing robot, “A”-fol­low­ing agent, and ran­dom policy agent from the last sec­tion all seem like they don’t have prefer­ences (in the English sense, not the math sense).

Per­haps you could define a prefer­ence or­der­ing by say­ing “if I gave the agent lots of time to think, how would it choose be­tween these two his­to­ries?” How­ever, you could ap­ply this defi­ni­tion to any­thing, in­clud­ing eg. a ther­mo­stat, or a rock. You might ar­gue that a ther­mo­stat or rock can’t “choose” be­tween two his­to­ries; but then it’s un­clear how to define how an AI “chooses” be­tween two his­to­ries with­out that defi­ni­tion also ap­ply­ing to ther­mostats and rocks.

Of course, you could always define a prefer­ence or­der­ing based on the AI’s ob­served be­hav­ior, but then you’re back in the set­ting of the first sec­tion, where all ob­served be­hav­ior can be mod­eled as max­i­miz­ing an ex­pected util­ity func­tion and so say­ing “the AI is an ex­pected util­ity max­i­mizer” is vac­u­ous.

Con­ver­gent in­stru­men­tal sub­goals are about goal-di­rected behavior

One of the clas­sic rea­sons to worry about ex­pected util­ity max­i­miz­ers is the pres­ence of con­ver­gent in­stru­men­tal sub­goals, de­tailed in Omo­hun­dro’s pa­per The Ba­sic AI Drives. The pa­per it­self is clearly talk­ing about goal-di­rected AI sys­tems:

To say that a sys­tem of any de­sign is an “ar­tifi­cial in­tel­li­gence”, we mean that it has goals which it tries to ac­com­plish by act­ing in the world.

It then ar­gues (among other things) that such AI sys­tems will want to “be ra­tio­nal” and so will dis­till their goals into util­ity func­tions, which they then max­i­mize. And once they have util­ity func­tions, they will pro­tect them from mod­ifi­ca­tion.

Note that this starts from the as­sump­tion of goal-di­rected be­hav­ior and de­rives that the AI will be an EU max­i­mizer along with the other con­ver­gent in­stru­men­tal sub­goals. The co­her­ence ar­gu­ments all im­ply that AIs will be EU max­i­miz­ers for some (pos­si­bly de­gen­er­ate) util­ity func­tion; they don’t im­ply that the AI must be goal-di­rected.

Good­hart’s Law is about goal-di­rected behavior

A com­mon ar­gu­ment for wor­ry­ing about AI risk is that we know that a su­per­in­tel­li­gent AI sys­tem will look to us like an EU max­i­mizer, and if it max­i­mizes a util­ity func­tion that is even slightly wrong we could get catas­trophic out­comes.

By now you prob­a­bly know my first re­sponse: that any be­hav­ior can be mod­eled as an EU max­i­mizer, and so this ar­gu­ment proves too much, sug­gest­ing that any be­hav­ior causes catas­trophic out­comes. But let’s set that aside for now.

The sec­ond part of the claim comes from ar­gu­ments like Value is Frag­ile and Good­hart’s Law. How­ever, if we con­sider util­ity func­tions that as­sign value 1 to some his­to­ries and 0 to oth­ers, then if you ac­ci­den­tally as­sign a his­tory where I need­lessly stub my toe a 1 in­stead of a 0, that’s a slightly wrong util­ity func­tion, but it isn’t go­ing to lead to catas­trophic out­comes.

The worry about util­ity func­tions that are slightly wrong holds wa­ter when the util­ity func­tions are wrong about some high-level con­cept, like whether hu­mans care about their ex­pe­riences re­flect­ing re­al­ity. This is a very rar­efied, par­tic­u­lar dis­tri­bu­tion of util­ity func­tions, that are all go­ing to lead to goal-di­rected or agen­tic be­hav­ior. As a re­sult, I think that the ar­gu­ment is bet­ter stated as “if you have a slightly in­cor­rect goal, you can get catas­trophic out­comes”. And there aren’t any co­her­ence ar­gu­ments that say that agents must have goals.

Wire­head­ing is about ex­plicit re­ward maximization

There are a few pa­pers that talk about the prob­lems that arise with a very pow­er­ful sys­tem with a re­ward func­tion or util­ity func­tion, most no­tably wire­head­ing. The ar­gu­ment that AIXI will seize con­trol of its re­ward chan­nel falls into this cat­e­gory. In these cases, typ­i­cally the AI sys­tem is con­sid­er­ing mak­ing a change to the sys­tem by which it eval­u­ates good­ness of ac­tions, and the good­ness of the change is eval­u­ated by the sys­tem af­ter the change. Daniel Dewey ar­gues in Learn­ing What to Value that if the change is eval­u­ated by the sys­tem be­fore the change, then these prob­lems go away.

I think of these as prob­lems with re­ward max­i­miza­tion, be­cause typ­i­cally when you phrase the prob­lem as max­i­miz­ing re­ward, you are max­i­miz­ing the sum of re­wards ob­tained in all timesteps, no mat­ter how those re­wards are ob­tained (i.e. even if you self-mod­ify to make the re­ward max­i­mal). It doesn’t seem like AI sys­tems have to be built this way (though ad­mit­tedly I do not know how to build AI sys­tems that re­li­ably avoid these prob­lems).


In this post I’ve ar­gued that many of the prob­lems we typ­i­cally as­so­ci­ate with ex­pected util­ity max­i­miz­ers are ac­tu­ally prob­lems with goal-di­rected agents or with ex­plicit re­ward max­i­miza­tion. Co­her­ence ar­gu­ments only im­ply that a su­per­in­tel­li­gent AI sys­tem will look like an ex­pected util­ity max­i­mizer, but this is ac­tu­ally a very weak con­straint, and there are many po­ten­tial util­ity func­tions for which the re­sult­ing AI sys­tem is nei­ther goal-di­rected nor ex­plicit-re­ward-max­i­miz­ing. This sug­gests that we could try to build AI sys­tems of this type, in or­der to sidestep the prob­lems that we have iden­ti­fied so far.

To­mor­row will have a break from AI Align­ment Fo­rum se­quences, and the post will in­stead be Is­sue #35 of the Align­ment Newslet­ter, by Ro­hin Shah.

The next post in this se­quence will be ‘Will hu­mans build goal-di­rected agents?’ by Ro­hin Shah, on Wed­nes­day 5th De­cem­ber.