Intuitions about goal-directed behavior

One broad ar­gu­ment for AI risk is the Misspeci­fied Goal ar­gu­ment:

The Misspeci­fied Goal Ar­gu­ment for AI Risk: Very in­tel­li­gent AI sys­tems will be able to make long-term plans in or­der to achieve their goals, and if their goals are even slightly mis­speci­fied then the AI sys­tem will be­come ad­ver­sar­ial and work against us.

My main goal in this post is to make con­cep­tual clar­ifi­ca­tions and sug­gest how they af­fect the Misspeci­fied Goal ar­gu­ment, with­out mak­ing any recom­men­da­tions about what we should ac­tu­ally do. Fu­ture posts will ar­gue more di­rectly for a par­tic­u­lar po­si­tion. As a re­sult, I will not be con­sid­er­ing other ar­gu­ments for fo­cus­ing on AI risk even though I find some of them more com­pel­ling.

I think of this as a con­cern about long-term goal-di­rected be­hav­ior. Un­for­tu­nately, it’s not clear how to cat­e­go­rize be­hav­ior as goal-di­rected vs. not. In­tu­itively, any agent that searches over ac­tions and chooses the one that best achieves some mea­sure of “good­ness” is goal-di­rected (though there are ex­cep­tions, such as the agent that se­lects ac­tions that be­gin with the let­ter “A”). How­ever, this is not a nec­es­sary con­di­tion: many hu­mans are goal-di­rected, but there is no goal baked into the brain that they are us­ing to choose ac­tions.

This is re­lated to the con­cept of op­ti­miza­tion, though with in­tu­itions around op­ti­miza­tion we typ­i­cally as­sume that we know the agent’s prefer­ence or­der­ing, which I don’t want to as­sume here. (In fact, I don’t want to as­sume that the agent even has a prefer­ence or­der­ing.)

One po­ten­tial for­mal­iza­tion is to say that goal-di­rected be­hav­ior is any be­hav­ior that can be mod­el­led as max­i­miz­ing ex­pected util­ity for some util­ity func­tion; in the next post I will ar­gue that this does not prop­erly cap­ture the be­hav­iors we are wor­ried about. In this post I’ll give some in­tu­itions about what “goal-di­rected be­hav­ior” means, and how these in­tu­itions re­late to the Misspeci­fied Goal ar­gu­ment.

Gen­er­al­iza­tion to novel circumstances

Con­sider two pos­si­ble agents for play­ing some game, let’s say TicTacToe. The first agent looks at the state and the rules of the game, and uses the min­i­max al­gorithm to find the op­ti­mal move to play. The sec­ond agent has a gi­ant lookup table that tells it what move to play given any state. In­tu­itively, the first one is more “agen­tic” or “goal-driven”, while the sec­ond one is not. But both of these agents play the game in ex­actly the same way!

The differ­ence is in how the two agents gen­er­al­ize to new situ­a­tions. Let’s sup­pose that we sud­denly change the rules of TicTacToe—per­haps now the win con­di­tion is re­versed, so that any­one who gets three in a row loses. The min­i­max agent is still go­ing to be op­ti­mal at this game, whereas the lookup-table agent will lose against any op­po­nent with half a brain. The min­i­max agent looks like it is “try­ing to win”, while the lookup-table agent does not. (You could say that the lookup-table agent is “try­ing to take ac­tions ac­cord­ing to <policy>”, but this is a weird com­pli­cated goal so maybe it doesn’t count.)

In gen­eral, when we say that an agent is pur­su­ing some goal, this is meant to al­low us to pre­dict how the agent will gen­er­al­ize to some novel cir­cum­stance. This sort of rea­son­ing is crit­i­cal for the Goal-Directed ar­gu­ment for AI risk. For ex­am­ple, we worry that an AI agent will pre­vent us from turn­ing it off, be­cause that would pre­vent it from achiev­ing its goal: “You can’t fetch the coffee if you’re dead.” This is a pre­dic­tion about what an AI agent would do in the novel cir­cum­stance where a hu­man is try­ing to turn the agent off.

This sug­gests a way to char­ac­ter­ize these sorts of goal-di­rected agents: there is some goal such that the agent’s be­hav­ior in new cir­cum­stances can be pre­dicted by figur­ing out which be­hav­ior best achieves the goal. There’s a lot of com­plex­ity in the space of goals we con­sider: some­thing like “hu­man well-be­ing” should count, but “the par­tic­u­lar policy <x>” and “pick ac­tions that start with the let­ter A” should not. When I use the word goal I mean to in­clude only the first kind, even though I cur­rently don’t know the­o­ret­i­cally how to dis­t­in­guish be­tween the var­i­ous cases.

Note that this is in stark con­trast to ex­ist­ing AI sys­tems, which are par­tic­u­larly bad at gen­er­al­iz­ing to new situ­a­tions.

Hon­estly, I’m sur­prised it’s only 90%. [1]


We could also look at whether or not the agent ac­quires more power and re­sources. It seems likely that an agent that is op­ti­miz­ing for some goal over the long term would want more power and re­sources in or­der to more eas­ily achieve that goal. In ad­di­tion, the agent would prob­a­bly try to im­prove its own al­gorithms in or­der to be­come more in­tel­li­gent.

This feels like a con­se­quence of goal-di­rected be­hav­ior, and not its defin­ing char­ac­ter­is­tic, be­cause it is about be­ing able to achieve a wide va­ri­ety of goals, in­stead of a par­tic­u­lar one. Nonethe­less, it seems cru­cial to the broad ar­gu­ment for AI risk pre­sented above, since an AI sys­tem will prob­a­bly need to first ac­cu­mu­late power, re­sources, in­tel­li­gence, etc. in or­der to cause catas­trophic out­comes.

I find this con­cept most use­ful when think­ing about the prob­lem of in­ner op­ti­miz­ers, where in the course of op­ti­miza­tion through a rich space you stum­ble across a mem­ber of the space that is it­self do­ing op­ti­miza­tion, but for a re­lated but still mis­speci­fied met­ric. Since the in­ner op­ti­mizer is be­ing “con­trol­led” by the outer op­ti­miza­tion pro­cess, it is prob­a­bly not go­ing to cause ma­jor harm un­less it is able to “take over” the outer op­ti­miza­tion pro­cess, which sounds a lot like ac­cu­mu­lat­ing power. (This dis­cus­sion is ex­tremely im­pre­cise and vague; see the up­com­ing MIRI pa­per on “The In­ner Align­ment Prob­lem” for a more thor­ough dis­cus­sion.)

Our un­der­stand­ing of the behavior

There is a gen­eral pat­tern in which as soon as we un­der­stand some­thing, it be­comes some­thing lesser. As soon as we un­der­stand rain­bows, they are rel­e­gated to the “dull cat­a­logue of com­mon things”. This sug­gests a some­what cyn­i­cal ex­pla­na­tion of our con­cept of “in­tel­li­gence”: an agent is con­sid­ered in­tel­li­gent if we do not know how to achieve the out­comes it does us­ing the re­sources that it has (in which case our best model for that agent may be that it is pur­su­ing some goal, re­flect­ing our ten­dency to an­thro­po­mor­phize). That is, our eval­u­a­tion about in­tel­li­gence is a state­ment about our epistemic state. Some ex­am­ples that fol­low this pat­tern are:

  • As soon as we un­der­stand how some AI tech­nique solves a challeng­ing prob­lem, it is no longer con­sid­ered AI. Be­fore we’ve solved the prob­lem, we imag­ine that we need some sort of “in­tel­li­gence” that is pointed to­wards the goal and solves it: the only method we have of pre­dict­ing what this AI sys­tem will do is to think about what a sys­tem that tries to achieve the goal would do. Once we un­der­stand how the AI tech­nique works, we have more in­sight into what it is do­ing and can make more de­tailed pre­dic­tions about where it will work well, where it tends to make mis­takes, etc. and so it no longer seems like “in­tel­li­gence”. Once you know that OpenAI Five is trained by self-play, you can pre­dict that they haven’t seen cer­tain be­hav­iors like stand­ing still to turn in­visi­ble, and prob­a­bly won’t work well there.

  • Be­fore we un­der­stood the idea of nat­u­ral se­lec­tion and evolu­tion, we would look at the com­plex­ity of na­ture and as­cribe it to in­tel­li­gent de­sign; once we had the math­e­mat­ics (and even just the qual­i­ta­tive in­sight), we could make much more de­tailed pre­dic­tions, and na­ture no longer seemed like it re­quired in­tel­li­gence. For ex­am­ple, we can pre­dict the timescales on which we can ex­pect evolu­tion­ary changes, which we couldn’t do if we just mod­eled evolu­tion as op­ti­miz­ing re­pro­duc­tive fit­ness.

  • Many phe­nom­ena (eg. rain, wind) that we now have sci­en­tific ex­pla­na­tions for were pre­vi­ously ex­plained to be the re­sult of some an­thro­po­mor­phic de­ity.

  • When some­one performs a feat of men­tal math, or can tell you in­stantly what day of the week a ran­dom date falls on, you might be im­pressed and think them very in­tel­li­gent. But if they ex­plain to you how they did it, you may find it much less im­pres­sive. (Though of course these feats are se­lected to seem more im­pres­sive than they are.)

Note that an al­ter­na­tive hy­poth­e­sis is that hu­mans equate in­tel­li­gence with mys­tery; as we learn more and re­move mys­tery around eg. evolu­tion, we au­to­mat­i­cally think of it as less in­tel­li­gent.

To the ex­tent that the Misspeci­fied Goal ar­gu­ment re­lies on this in­tu­ition, the ar­gu­ment feels a lot weaker to me. If the Misspeci­fied Goal ar­gu­ment rested en­tirely upon this in­tu­ition, then it would be as­sert­ing that be­cause we are ig­no­rant about what an in­tel­li­gent agent would do, we should as­sume that it is op­ti­miz­ing a goal, which means that it is go­ing to ac­cu­mu­late power and re­sources and lead to catas­tro­phe. In other words, it is ar­gu­ing that as­sum­ing that an agent is in­tel­li­gent defi­ni­tion­ally means that it will ac­cu­mu­late power and re­sources. This seems clearly wrong; it is pos­si­ble in prin­ci­ple to have an in­tel­li­gent agent that nonethe­less does not ac­cu­mu­late power and re­sources.

Also, the ar­gu­ment is not say­ing that in prac­tice most in­tel­li­gent agents ac­cu­mu­late power and re­sources. It says that we have no bet­ter model to go off of other than “goal-di­rected”, and then pushes this model to ex­treme sce­nar­ios where we should have a lot more un­cer­tainty.

To be clear, I do not think that any­one would en­dorse the ar­gu­ment as stated. I am sug­gest­ing as a pos­si­bil­ity that the Misspeci­fied Goal ar­gu­ment re­lies on us in­cor­rectly equat­ing su­per­in­tel­li­gence with “pur­su­ing a goal” be­cause we use “pur­su­ing a goal” as a de­fault model for any­thing that can do in­ter­est­ing things, even if that is not the best model to be us­ing.


In­tu­itively, goal-di­rected be­hav­ior can lead to catas­trophic out­comes with a suffi­ciently in­tel­li­gent agent, be­cause the op­ti­mal be­hav­ior for even a slightly mis­speci­fied goal can be very bad ac­cord­ing to the true goal. How­ever, it’s not clear ex­actly what we mean by goal-di­rected be­hav­ior. A suffi­cient con­di­tion is that the al­gorithm searches over pos­si­ble ac­tions and chooses the one with the high­est “good­ness”, but this is not a nec­es­sary con­di­tion.

“From the out­side”, it seems like a goal-di­rected agent is char­ac­ter­ized by the fact that we can pre­dict the agent’s be­hav­ior in new situ­a­tions by as­sum­ing that it is pur­su­ing some goal, and as a re­sult it is ac­quires power and re­sources. This can be in­ter­preted ei­ther as a state­ment about our epistemic state (we know so lit­tle about the agent that our best model is that it pur­sues a goal, even though this model is not very ac­cu­rate or pre­cise) or as a state­ment about the agent (pre­dict­ing the be­hav­ior of the agent in new situ­a­tions based on pur­suit of a goal ac­tu­ally has very high pre­ci­sion and ac­cu­racy). Th­ese two views have very differ­ent im­pli­ca­tions on the val­idity of the Misspeci­fied Goal ar­gu­ment for AI risk.

[1] This is an en­tirely made-up num­ber.

To­mor­row’s AI Align­ment Fo­rum se­quences post will be ‘Benign model-free RL’ by Paul Chris­ti­ano in the se­quence on iter­ated am­plifi­ca­tion.

The next post in this se­quence will be ‘Co­her­ence ar­gu­ments do not im­ply goal-di­rected be­hav­ior’ by Ro­hin Shah, on Sun­day 2nd De­cem­ber.