Will humans build goal-directed agents?

In the pre­vi­ous post, I ar­gued that sim­ply know­ing that an AI sys­tem is su­per­in­tel­li­gent does not im­ply that it must be goal-di­rected. How­ever, there are many other ar­gu­ments that sug­gest that AI sys­tems will or should be goal-di­rected, which I will dis­cuss in this post.

Note that I don’t think of this as the Tool AI vs. Agent AI ar­gu­ment: it seems pos­si­ble to build agent AI sys­tems that are not goal-di­rected. For ex­am­ple, imi­ta­tion learn­ing al­lows you to cre­ate an agent that be­haves similarly to an­other agent—I would clas­sify this as “Agent AI that is not goal-di­rected”. (But see this com­ment thread for dis­cus­sion.)

Note that these ar­gu­ments have differ­ent im­pli­ca­tions than the ar­gu­ment that su­per­in­tel­li­gent AI must be goal-di­rected due to co­her­ence ar­gu­ments. Sup­pose you be­lieve all of the fol­low­ing:

  • Any of the ar­gu­ments in this post.

  • Su­per­in­tel­li­gent AI is not re­quired to be goal-di­rected, as I ar­gued in the last post.

  • Goal-di­rected agents cause catas­tro­phe by de­fault.

Then you could try to cre­ate al­ter­na­tive de­signs for AI sys­tems such that they can do the things that goal-di­rected agents can do with­out them­selves be­ing goal-di­rected. You could also try to per­suade AI re­searchers of these facts, so that they don’t build goal-di­rected sys­tems.

Eco­nomic effi­ciency: goal-di­rected humans

Hu­mans want to build pow­er­ful AI sys­tems in or­der to help them achieve their goals—it seems quite clear that hu­mans are at least par­tially goal-di­rected. As a re­sult, it seems nat­u­ral that they would build AI sys­tems that are also goal-di­rected.

This is re­ally an ar­gu­ment that the sys­tem com­pris­ing the hu­man and AI agent should be di­rected to­wards some goal. The AI agent by it­self need not be goal-di­rected as long as we get goal-di­rected be­hav­ior when com­bined with a hu­man op­er­a­tor. How­ever, in the situ­a­tion where the AI agent is much more in­tel­li­gent than the hu­man, it is prob­a­bly best to del­e­gate most or all de­ci­sions to the agent, and so the agent could still look mostly goal-di­rected.

Even so, you could imag­ine that even the small part of the work that the hu­man con­tinues to do al­lows the agent to not be goal-di­rected, es­pe­cially over long hori­zons. For ex­am­ple, per­haps the hu­man de­cides what the agent should do each day, and the agent ex­e­cutes the in­struc­tion, which in­volves plan­ning over the course of a day, but no longer. (I am not ar­gu­ing that this is safe; on the con­trary, hav­ing very pow­er­ful op­ti­miza­tion over the course of a day seems prob­a­bly un­safe.) This could be ex­tremely pow­er­ful with­out the AI be­ing goal-di­rected over the long term.

Another ex­am­ple would be a cor­rigible agent, which could be ex­tremely pow­er­ful while not be­ing goal-di­rected over the long term. (Though the mean­ings of “goal-di­rected” and “cor­rigible” are suffi­ciently fuzzy that this is not ob­vi­ous and de­pends on the defi­ni­tions we set­tle on for each.)

Eco­nomic effi­ciency: be­yond hu­man performance

Another benefit of goal-di­rected be­hav­ior is that it al­lows us to find novel ways of achiev­ing our goals that we may not have thought of, such as AlphaGo’s move 37. Goal-di­rected be­hav­ior is one of the few meth­ods we know of that al­low AI sys­tems to ex­ceed hu­man perfor­mance.

I think this is a good ar­gu­ment for goal-di­rected be­hav­ior, but given the prob­lems of goal-di­rected be­hav­ior I think it’s worth search­ing for al­ter­na­tives, such as the two ex­am­ples in the pre­vi­ous sec­tion (op­ti­miz­ing over a day, and cor­rigi­bil­ity). Alter­na­tively, we could learn hu­man rea­son­ing, and ex­e­cute it for a longer sub­jec­tive time than hu­mans would, in or­der to make bet­ter de­ci­sions. Or we could have sys­tems that re­main un­cer­tain about the goal and clar­ify what they should do when there are mul­ti­ple very differ­ent op­tions (though this has its own prob­lems).

Cur­rent progress in re­in­force­ment learning

If we had to guess to­day which paradigm would lead to AI sys­tems that can ex­ceed hu­man perfor­mance, I would guess re­in­force­ment learn­ing (RL). In RL, we have a re­ward func­tion and we seek to choose ac­tions that max­i­mize the sum of ex­pected dis­counted re­wards. This sounds a lot like an agent that is search­ing over ac­tions for the best one ac­cord­ing to a mea­sure of good­ness (the re­ward func­tion [1]), which I said pre­vi­ously is a goal-di­rected agent. And the math be­hind RL says that the agent should be try­ing to max­i­mize its re­ward for the rest of time, which makes it long-term [2].

That said, cur­rent RL agents learn to re­play be­hav­ior that in their past ex­pe­rience worked well, and typ­i­cally do not gen­er­al­ize out­side of the train­ing dis­tri­bu­tion. This does not seem like a search over ac­tions to find ones that are the best. In par­tic­u­lar, you shouldn’t ex­pect a treach­er­ous turn, since the whole point of a treach­er­ous turn is that you don’t see it com­ing be­cause it never hap­pened be­fore.

In ad­di­tion, cur­rent RL is epi­sodic, so we should only ex­pect that RL agents are goal-di­rected over the cur­rent epi­sode and not in the long-term. Of course, many tasks would have very long epi­sodes, such as be­ing a CEO. The vanilla deep RL ap­proach here would be to spec­ify a re­ward func­tion for how good a CEO you are, and then try many differ­ent ways of be­ing a CEO and learn from ex­pe­rience. This re­quires you to col­lect many full epi­sodes of be­ing a CEO, which would be ex­tremely time-con­sum­ing.

Per­haps with enough ad­vances in model-based deep RL we could train the model on par­tial tra­jec­to­ries and that would be enough, since it could gen­er­al­ize to full tra­jec­to­ries. I think this is a ten­able po­si­tion, though I per­son­ally don’t ex­pect it to work since it re­lies on our model gen­er­al­iz­ing well, which seems un­likely even with fu­ture re­search.

Th­ese ar­gu­ments lead me to be­lieve that we’ll prob­a­bly have to do some­thing that is not vanilla deep RL in or­der to train an AI sys­tem that can be a CEO, and that thing may not be goal-di­rected.

Over­all, it is cer­tainly pos­si­ble that im­proved RL agents will look like dan­ger­ous long-term goal-di­rected agents, but this does not seem to be the case to­day and there seem to be se­ri­ous difficul­ties in scal­ing cur­rent al­gorithms to su­per­in­tel­li­gent AI sys­tems that can op­ti­mize over the long term. (I’m not ar­gu­ing for long timelines here, since I wouldn’t be sur­prised if we figured out some way that wasn’t vanilla deep RL to op­ti­mize over the long term, but that method need not be goal-di­rected.)

Ex­ist­ing in­tel­li­gent agents are goal-directed

So far, hu­mans and per­haps an­i­mals are the only ex­am­ple of gen­er­ally in­tel­li­gent agents that we know of, and they seem to be quite goal-di­rected. This is some ev­i­dence that we should ex­pect in­tel­li­gent agents that we build to also be goal-di­rected.

Ul­ti­mately we are ob­serv­ing a cor­re­la­tion be­tween two things with sam­ple size 1, which is re­ally not much ev­i­dence at all. If you be­lieve that many an­i­mals are also in­tel­li­gent and goal-di­rected, then per­haps the sam­ple size is larger, since there are in­tel­li­gent an­i­mals with very differ­ent evolu­tion­ary his­to­ries and neu­ral ar­chi­tec­tures (eg. oc­to­puses).

How­ever, this is speci­fi­cally about agents that were cre­ated by evolu­tion, which did a rel­a­tively stupid blind search over a large space, and we could use a differ­ent method to de­velop AI sys­tems. So this ar­gu­ment makes me more wary of cre­at­ing AI sys­tems us­ing evolu­tion­ary searches over large spaces, but it doesn’t make me much more con­fi­dent that all good AI sys­tems must be goal-di­rected.


Another ar­gu­ment for build­ing a goal-di­rected agent is that it al­lows us to pre­dict what it’s go­ing to do in novel cir­cum­stances. While you may not be able to pre­dict the spe­cific ac­tions it will take, you can pre­dict some fea­tures of the fi­nal world state, in the same way that if I were to play Mag­nus Car­lsen at chess, I can’t pre­dict how he will play, but I can pre­dict that he will win.

I do not un­der­stand the in­tent be­hind this ar­gu­ment. It seems as though faced with the nega­tive re­sults that sug­gest that goal-di­rected be­hav­ior tends to cause catas­trophic out­comes, we’re ar­gu­ing that it’s a good idea to build a goal-di­rected agent so that we can more eas­ily pre­dict that it’s go­ing to cause catas­tro­phe.

I also think that we would typ­i­cally be able to pre­dict sig­nifi­cantly more about what any AI sys­tem we ac­tu­ally build will do (than if we mod­eled it as try­ing to achieve some goal). This is be­cause “agent seek­ing a par­tic­u­lar goal” is one of the sim­plest mod­els we can build, and with any sys­tem we have more in­for­ma­tion on, we start re­fin­ing the model to make it bet­ter.


Over­all, I think there are good rea­sons to think that “by de­fault” we would de­velop goal-di­rected AI sys­tems, be­cause the things we want AIs to do can be eas­ily phrased as goals, and be­cause the stated goal of re­in­force­ment learn­ing is to build goal-di­rected agents (al­though they do not look like goal-di­rected agents to­day). As a re­sult, it seems im­por­tant to figure out ways to get the pow­er­ful ca­pa­bil­ities of goal-di­rected agents through agents that are not them­selves goal-di­rected. In par­tic­u­lar, this sug­gests that we will need to figure out ways to build AI sys­tems that do not in­volve spec­i­fy­ing a util­ity func­tion that the AI should op­ti­mize, or even learn­ing a util­ity func­tion that the AI then op­ti­mizes.

[1] Tech­ni­cally, ac­tions are cho­sen ac­cord­ing to the Q func­tion, but the dis­tinc­tion isn’t im­por­tant here.

[2] Dis­count­ing does cause us to pri­ori­tize short-term re­wards over long-term ones. On the other hand, dis­count­ing seems mostly like a hack to make the math not spit out in­fini­ties, and so that learn­ing is more sta­ble. On the third hand, in­finite hori­zon MDPs with undis­counted re­ward aren’t solv­able un­less you al­most surely en­ter an ab­sorb­ing state. So dis­count­ing com­pli­cates the pic­ture, but not in a par­tic­u­larly in­ter­est­ing way, and I don’t want to rest an ar­gu­ment against long-term goal-di­rected be­hav­ior on the pres­ence of dis­count­ing.

The AI Align­ment Fo­rum se­quences have been on pause over the De­cem­ber pe­riod. We now re­turn you to your reg­u­larly sched­uled pro­gram­ming.

To­mor­row’s AI Align­ment Fo­rum se­quences post will be ‘Su­per­vis­ing Strong Learn­ers by Am­plify­ing Weak Ex­perts’ by Paul Chris­ti­ano in the se­quence on iter­ated am­plifi­ca­tion.

The next post in this se­quence will be ‘AI safety with­out goal-di­rected be­hav­ior’ by Ro­hin Shah, on Sun­day 6th Jan­uary.