Seeking Power is Often Provably Instrumentally Convergent in MDPs

Link post

In 2008, Steve Omo­hun­dro’s foun­da­tional Ba­sic AI Drives made im­por­tant con­jec­tures about what su­per­in­tel­li­gent goal-di­rected AIs might do, in­clud­ing gain­ing as much power as pos­si­ble to best achieve their goals. Toy mod­els have been con­structed in which Omo­hun­dro’s con­jec­tures bear out, and the sup­port­ing philo­soph­i­cal ar­gu­ments are in­tu­itive. The con­jec­tures have re­cently been the cen­ter of de­bate be­tween well-known AI re­searchers.

In­stru­men­tal con­ver­gence has been heuris­ti­cally un­der­stood as an an­ti­ci­pated risk, but not as a for­mal phe­nomenon with a well-un­der­stood cause. The goal of this post (and ac­com­pa­ny­ing pa­per) is to change that.

My re­sults strongly sug­gest that, within the Markov de­ci­sion pro­cess for­mal­ism (the sta­ple of re­in­force­ment learn­ing[1]), the struc­ture of the agent’s en­vi­ron­ment means that most goals in­cen­tivize gain­ing power over that en­vi­ron­ment. Fur­ther­more, max­i­mally gain­ing power over an en­vi­ron­ment is bad for other agents therein. That is, power seems con­stant-sum af­ter a cer­tain point.

I’m go­ing to provide the in­tu­itions for a mechanis­tic un­der­stand­ing of power and in­stru­men­tal con­ver­gence, and then in­for­mally show how op­ti­mal ac­tion usu­ally means try­ing to stay al­ive, gain power, and take over the world; read the pa­per for the rigor­ous ver­sion. Lastly, I’ll talk about why these re­sults ex­cite me.


I claim that

The struc­ture of the agent’s en­vi­ron­ment means that most goals in­cen­tivize gain­ing power over that en­vi­ron­ment.

By en­vi­ron­ment, I mean the thing the agent thinks it’s in­ter­act­ing with. Here, we’re go­ing to think about du­al­is­tic en­vi­ron­ments where you can see the whole state, where there are only finitely many states to see and ac­tions to take. Also, fu­ture stuff gets ge­o­met­ri­cally dis­counted; at dis­count rate , this means stuff in one turn is half as im­por­tant as stuff now, stuff in two turns is a quar­ter as im­por­tant, and so on. Pac-Man is an en­vi­ron­ment struc­tured like this: you see the game screen (the state), you take an ac­tion, and then you get a re­sult (an­other state). There’s only finitely many screens, and only finitely many ac­tions – they all had to fit onto the ar­cade con­trol­ler, af­ter all!

When I talk about “goals”, I’m talk­ing about re­ward func­tions over states: each way-the-world-could-be gets as­signed some point value. The canon­i­cal way of earn­ing points in Pac-Man is just one pos­si­ble re­ward func­tion for the game.

In­stru­men­tal con­ver­gence sup­pos­edly ex­ists for suffi­ciently wide va­ri­eties of goals, so to­day we’ll think about the most va­ri­ety pos­si­ble: the dis­tri­bu­tion of goals where each pos­si­ble state is uniformly ran­domly as­signed a re­ward in the in­ter­val (al­though the the­o­rems hold for a lot more dis­tri­bu­tions than this[2]). Some­times, I’ll say things like “most agents do ”, which means “max­i­miz­ing to­tal dis­counted re­ward usu­ally en­tails do­ing when your goals are drawn from the uniform dis­tri­bu­tion”. We say agents are “far­sighted” when the dis­count rate is suffi­ciently close to 1 (the agent doesn’t pri­ori­tize im­me­di­ate re­ward over de­layed grat­ifi­ca­tion).


You can do things in the world and take differ­ent paths through time. Let’s call these paths “pos­si­bil­ities”; they’re like film­strips of how the fu­ture could go.

If you have more con­trol over the fu­ture, you’re usu­ally[3] choos­ing among more paths-through-time. This lets you more pre­cisely con­trol what kinds of things hap­pen later. This is one way to con­cretize what peo­ple mean when they use the word ‘power’ in ev­ery­day speech, and will be the defi­ni­tion used go­ing for­ward: the abil­ity to achieve goals in gen­eral.[4] In other words, power is the av­er­age at­tain­able util­ity across a dis­tri­bu­tion of goals.

This defi­ni­tion seems philo­soph­i­cally rea­son­able: if you have a lot of money, you can make more things hap­pen and have more power. If you have so­cial clout, you can spend that in var­i­ous ways to bet­ter tai­lor the fu­ture to var­i­ous ends. Dy­ing means you can’t do much at all, and all else equal, los­ing a limb de­creases your power.

Ex­er­cise: spend a few min­utes con­sid­er­ing whether real-world in­tu­itive ex­am­ples of power are ex­plained by this defi­ni­tion.

Once you feel com­fortable that it’s at least a pretty good defi­ni­tion, we can move on.

Imag­ine a sim­ple game with three choices: eat candy, eat a choco­late bar, or hug a friend.

The power of a state is how well agents can gen­er­ally do by start­ing from that state. It’s im­por­tant to note that we’re con­sid­er­ing power from be­hind a “veil of ig­no­rance” about the re­ward func­tion. We’re av­er­ag­ing the best we can do for a lot of differ­ent in­di­vi­d­ual goals.

Each re­ward func­tion has an op­ti­mal pos­si­bil­ity, or path-through-time. If choco­late has max­i­mal re­ward, then the op­ti­mal pos­si­bil­ity is .

Since the dis­tri­bu­tion ran­domly as­signs a value in to each state, an agent can ex­pect to av­er­age re­ward. This is be­cause you’re choos­ing be­tween three choices, each of which has some value be­tween and . The ex­pected max­i­mum of draws from uniform is ; you have three draws here, so you ex­pect to be able to get re­ward. Now, some re­ward func­tions do worse than this, and some do bet­ter; but on av­er­age, they get re­ward. You can test this out for your­self.

If you have no choices, you ex­pect to av­er­age re­ward: some­times the fu­ture is great, some­times it’s not. Con­versely, the more things you can choose be­tween, the closer this gets to (i.e., you can do well by all goals, be­cause each has a great chance of be­ing able to steer the fu­ture how you want).

In­stru­men­tal convergence

Plans that help you bet­ter reach a lot of goals are called in­stru­men­tally con­ver­gent. To travel as quickly as pos­si­ble to a ran­domly se­lected co­or­di­nate on Earth, one likely be­gins by driv­ing to the near­est air­port. Although it’s pos­si­ble that the co­or­di­nate is within driv­ing dis­tance, it’s not likely. Driv­ing to the air­port would then be in­stru­men­tally con­ver­gent for travel-re­lated goals.

We define in­stru­men­tal con­ver­gence as op­ti­mal agents be­ing more likely to take one ac­tion than an­other at some point in the fu­ture. I want to em­pha­size that when I say “likely”, I mean from be­hind the veil of ig­no­rance. Sup­pose I say that it’s 50% likely that agents go left, and 50% likely they go right. This doesn’t mean any agent has the stochas­tic policy of 50% left /​ 50% right. This means that, when draw­ing goals from our dis­tri­bu­tion, 50% of the time op­ti­mal pur­suit of the goal en­tails go­ing left, and 50% of the time it en­tails go­ing right.

Con­sider ei­ther eat­ing candy now, or earn­ing some re­ward for wait­ing a sec­ond be­fore choos­ing be­tween choco­late and hugs.

Let’s think about how op­ti­mal ac­tion tends to change as we start car­ing about the fu­ture more. Think about all the places you can be af­ter just one turn:

We could be in two places. Imag­ine we only care about the re­ward we get next turn. How many goals choose over ? Well, it’s 50-50 – since we ran­domly choose a num­ber be­tween 0 and 1 for each state, both states have an equal chance of be­ing max­i­mal. About half of near­sighted agents go to and half go to . There isn’t much in­stru­men­tal con­ver­gence yet. Note that this is also why near­sighted agents tend not to seek power.

Now think about where we can be in two turns:

We could be in three places. Sup­pos­ing we care more about the fu­ture, more of our fu­ture con­trol is com­ing from . In other words, about two thirds of our power is com­ing from our abil­ity to . But is in­stru­men­tally con­ver­gent? If the agent is far­sighted, the an­swer is yes (why?).

In the limit of far­sight­ed­ness, the chance of each pos­si­bil­ity be­ing op­ti­mal ap­proaches (each ter­mi­nal state has an equal chance to be max­i­mal).

There are two im­por­tant things hap­pen­ing here.

Im­por­tant Thing #1

In­stru­men­tal con­ver­gence doesn’t hap­pen in all en­vi­ron­ments. An agent start­ing at blue isn’t more likely to go up or down at any given point in time.

There’s also never in­stru­men­tal con­ver­gence when the agent doesn’t care about the fu­ture at all (when ). How­ever, let’s think back to what hap­pens in the wait­ing en­vi­ron­ment:

As the agent be­comes far­sighted, the and pos­si­bil­ities be­come more likely.

We can show that in­stru­men­tal con­ver­gence ex­ists in an en­vi­ron­ment if and only if a path through time be­comes more likely as the agent cares more about the fu­ture.

Im­por­tant Thing #2

The more con­trol-at-fu­ture-timesteps an ac­tion pro­vides, the more likely it is to be se­lected. What an in­trigu­ing “co­in­ci­dence”!


So, it sure seems like gain­ing power is a good idea for a lot of agents!

Hav­ing tasted a few hints for why this is true, we’ll now walk through the in­tu­itions a lit­tle more ex­plic­itly. This, in turn, will show some pretty cool things: most agents avoid dy­ing in Pac-Man, keep the Tic-Tac-Toe game go­ing as long as pos­si­ble, and avoid de­ac­ti­va­tion in real life.[5]

Let’s fo­cus on an en­vi­ron­ment with the same rules as Tic-Tac-Toe, but con­sid­er­ing the uniform dis­tri­bu­tion over re­ward func­tions. The agent (play­ing ) keeps ex­pe­rienc­ing the fi­nal state over and over when the game’s done. We bake the op­po­nent’s policy into the en­vi­ron­ment’s rules: when you choose a move, the game au­to­mat­i­cally replies.

When­ever we make a move that ends the game, we can’t reach any­thing else – we have to stay put. Since each fi­nal state has the same chance of be­ing op­ti­mal, a move which doesn’t end the game is more likely than a move which does. Let’s look at part of the game tree, with in­stru­men­tally con­ver­gent moves shown in green.

Start­ing on the left, all but one move leads to end­ing the game, but the sec­ond-to-last move al­lows us to keep choos­ing be­tween five more fi­nal out­comes. For rea­son­ably far­sighted agents at the first state, the green move is ~50% likely to be op­ti­mal, while each of the oth­ers are only best for ~10% of goals. So we see a kind of “self-preser­va­tion” aris­ing, even in Tic-Tac-Toe.

Re­mem­ber how, as the agent gets more far­sighted, more of its con­trol comes from choos­ing be­tween and , while also these two pos­si­bil­ities be­come more and more likely?

The same thing is hap­pen­ing in Tic-Tac-Toe. Let’s think about what hap­pens as the agent cares more about later and later time steps.

The ini­tial green move con­tributes more and more con­trol, so it be­comes more and more likely as we be­come more far­sighted. This isn’t a co­in­ci­dence.

Power-seek­ing is in­stru­men­tally con­ver­gent.

Rea­sons for excitement

The di­rect takeaway

I’m ob­vi­ously not “ex­cited” that power-seek­ing hap­pens by de­fault, but I’m ex­cited that we can see this risk more clearly. I’m also plan­ning on get­ting this work peer-re­viewed be­fore pur­pose­fully en­ter­ing it into the afore­men­tioned main­stream de­bate, but here are some of my pre­limi­nary thoughts.

Imag­ine you have good for­mal rea­sons to sus­pect that typ­ing ran­dom strings will usu­ally blow up your com­puter and kill you. Would you then say, “I’m not plan­ning to type ran­dom strings”, and pro­ceed to en­ter your the­sis into a word pro­ces­sor? No. You wouldn’t type any­thing yet, not un­til you re­ally, re­ally un­der­stand what makes the com­puter blow up some­times.

The over­all con­cern raised by [the power-seek­ing the­o­rem] is not that we will build pow­er­ful RL agents with ran­domly se­lected goals. The con­cern is that ran­dom re­ward func­tion in­puts pro­duce ad­ver­sar­ial power-seek­ing be­hav­ior, which can pro­duce per­verse in­cen­tives such as avoid­ing de­ac­ti­va­tion and ap­pro­pri­at­ing re­sources. There­fore, we should have spe­cific rea­son to be­lieve that pro­vid­ing the re­ward func­tion we had in mind will not end in catas­tro­phe.

Speak­ing to the broader de­bate tak­ing place in the AI re­search com­mu­nity, I think a pro­duc­tive pos­ture here will be in­ves­ti­gat­ing and un­der­stand­ing these re­sults in more de­tail, get­ting cu­ri­ous about un­ex­pected phe­nom­ena, and see­ing how the num­bers crunch out in rea­son­able mod­els. I think that even though the al­ign­ment com­mu­nity may have su­perfi­cially un­der­stood many of these con­clu­sions, there are many new con­cepts for the broader AI com­mu­nity to ex­plore.

In­ci­den­tally, if you’re a mem­ber of this broader com­mu­nity and have ques­tions, please feel free to email me at .

Ex­plain­ing catastrophes

AI al­ign­ment re­search can of­ten have a slip­pery feel­ing to it. We’re try­ing hard to be­come less con­fused about ba­sic con­cepts, and there’s only ev­ery­thing on the line.

What are “agents”? Do peo­ple even have “val­ues”, and should we try to get the AI to learn them? What does it mean to be “cor­rigible”, or “de­cep­tive”? What are our ma­chine learn­ing mod­els even do­ing? I mean, some­times we get a for­mal open ques­tion (and this the­ory of pos­si­bil­ities has a few of those), but not usu­ally.

We have to do philo­soph­i­cal work while in a state of sig­nifi­cant con­fu­sion and ig­no­rance about the na­ture of in­tel­li­gence and al­ign­ment. We’re grop­ing around in the dark with only pe­ri­odic flashes of in­sight to guide us.

In this con­text, we were like,

wow, it seems like ev­ery time I think of op­ti­mal plans for these ar­bi­trary goals, the AI can best com­plete them by gain­ing a ton of power to make sure it isn’t shut off. Every­thing slightly wrong leads to doom, ap­par­ently?

and we didn’t re­ally know why. In­tu­itively, it’s pretty ob­vi­ous that most agents don’t have de­ac­ti­va­tion as their dream out­come, but we couldn’t ac­tu­ally point to any for­mal ex­pla­na­tions, and we cer­tainly couldn’t make pre­cise pre­dic­tions.

On its own, Good­hart’s law doesn’t ex­plain why op­ti­miz­ing proxy goals leads to catas­troph­i­cally bad out­comes, in­stead of just less-than-ideal out­comes.

I’ve heard that, from this state of ig­no­rance, al­ign­ment pro­pos­als shouldn’t rely on in­stru­men­tal con­ver­gence be­ing a thing (and I agree). If you’re build­ing su­per­in­tel­li­gent sys­tems for which slight mis­takes ap­par­ently lead to ex­tinc­tion, and you want to eval­u­ate whether your pro­posal to avoid ex­tinc­tion will work, you ob­vi­ously want to deeply un­der­stand why ex­tinc­tion hap­pens by de­fault.

We’re now start­ing to have this kind of un­der­stand­ing. I sus­pect that power-seek­ing is the thing that makes ca­pa­ble goal-di­rected agency so dan­ger­ous.[6] If we want to con­sider more be­nign al­ter­na­tives to goal-di­rected agency, then deeply un­der­stand­ing why goal-di­rected agency is bad is im­por­tant for eval­u­at­ing al­ter­na­tives. This work lets us get a feel for the char­ac­ter of the un­der­ly­ing in­cen­tives of a pro­posed sys­tem de­sign.


Defin­ing power as “the abil­ity to achieve goals in gen­eral” seems to cap­ture just the right thing. I think it’s good enough that I view im­por­tant the­o­rems about power (as defined in the pa­per) as philo­soph­i­cally in­sight­ful.

Con­sid­er­ing power in this way seems to for­mally cap­ture our in­tu­itive no­tions about what re­sources are. For ex­am­ple, our cur­rent po­si­tion in the en­vi­ron­ment means that hav­ing money al­lows us to ex­ert more con­trol over the fu­ture. That is, our cur­rent po­si­tion in the state space means that hav­ing money al­lows more pos­si­bil­ities and greater power (in the for­mal sense). How­ever, pos­sess­ing green scraps of pa­per would not be as helpful if one were liv­ing alone near Alpha Cen­tauri. In a sense, re­source ac­qui­si­tion can nat­u­rally be viewed as tak­ing steps to in­crease one’s power.

Power might be im­por­tant for rea­son­ing about the strat­egy-steal­ing as­sump­tion (and I think it might be similar to what Paul means by “flex­ible in­fluence over the fu­ture”). Evan Hub­inger has already noted the util­ity of the dis­tri­bu­tion of at­tain­able util­ity shifts for think­ing about value-neu­tral­ity in this con­text (and power is an­other facet of the same phe­nomenon). If you want to think about whether, when, and why mesa op­ti­miz­ers might try to seize power, this the­ory seems like a valuable tool.

And, of course, we’re go­ing to use this no­tion of power to de­sign an im­pact mea­sure.

The for­mal­iza­tion of in­stru­men­tal con­ver­gence seems to be cor­rect. We’re able to now make de­tailed pre­dic­tions about e.g. how the difficulty of get­ting re­ward af­fects the level of far­sight­ed­ness at which seiz­ing power tends to make sense. This also might be rele­vant for think­ing about my­opic agency, as the broader the­ory for­mally de­scribes how op­ti­mal ac­tion tends to change with the dis­count fac­tor.

Another use­ful con­cep­tual dis­tinc­tion is that power and in­stru­men­tal con­ver­gence aren’t the same thing; we can con­struct en­vi­ron­ments where the state with the high­est power is not in­stru­men­tally con­ver­gent from an­other state.

ETA: Here’s an ex­cerpt from the pa­per:

So, just be­cause a state has more re­sources, doesn’t mean holds great op­por­tu­nity from the agent’s cur­rent van­tage point. In the above ex­am­ple, op­ti­mal ac­tion gen­er­ally means go­ing di­rectly to­wards the op­ti­mal ter­mi­nal state.

Here’s what the rele­vant cur­rent re­sults say: parts of the fu­ture al­low­ing you to reach more ter­mi­nal states are in­stru­men­tally con­ver­gent, and the for­mal POWER con­tri­bu­tions of differ­ent pos­si­bil­ities are ap­prox­i­mately pro­por­tion­ally re­lated to in­stru­men­tal con­ver­gence.

I think the Tic-Tac-Toe rea­son­ing is helpful: it’s in­stru­men­tally con­ver­gent to reach parts of the fu­ture which give you more con­trol from your cur­rent van­tage point. I’m work­ing on ex­pand­ing the for­mal re­sults to in­clude some ver­sion of this. I’ve since fur­ther clar­ified some claims made in the ini­tial ver­sion of this post.

The broader the­ory of pos­si­bil­ities lends sign­fi­cant in­sight into the struc­ture of Markov de­ci­sion pro­cesses; it feels like a piece of ba­sic the­ory that was never dis­cov­ered ear­lier, for what­ever rea­son. More on this an­other time.

Fu­ture deconfusion

What ex­cites me the most is a lit­tle more vague: there’s a new piece of AI al­ign­ment we can deeply un­der­stand, and un­der­stand­ing breeds un­der­stand­ing.


This work was made pos­si­ble by the Cen­ter for Hu­man-Com­pat­i­ble AI, the Berkeley Ex­is­ten­tial Risk Ini­ti­a­tive, and the Long-Term Fu­ture Fund.

Lo­gan Smith (elriggs) spent an enor­mous amount of time writ­ing Math­e­mat­ica code to com­pute power and mea­sure in ar­bi­trary toy MDPs, sav­ing me from need­ing to re­peat­edly do quin­tu­ple+ in­te­gra­tions by hand. I thank Ro­hin Shah for his de­tailed feed­back and brain­storm­ing over the sum­mer, and Tif­fany Cai for the ar­gu­ment that ar­bi­trary pos­si­bil­ities have ex­pected value (and so op­ti­mal av­er­age con­trol can’t be worse than this). Zack M. Davis, Chase De­necke, William Ells­worth, Vahid Ghadakchi, Ofer Givoli, Evan Hub­inger, Neale Rat­zlaff, Jess Riedel, Dun­can Sa­bien, Davide Zagami, and TheMa­jor gave feed­back on drafts of this post.

  1. It seems rea­son­able to ex­pect the key re­sults to gen­er­al­ize in spirit to larger classes of en­vi­ron­ments, but keep in mind that the claims I make are only proven to ap­ply to finite MDPs. ↩︎

  2. Speci­fi­cally, con­sider any con­tin­u­ous bounded dis­tri­bu­tion dis­tributed iden­ti­cally over the state space : . The kind of power-seek­ing and Tic-Tac-Toe-es­que in­stru­men­tal con­ver­gence I’m ges­tur­ing at should also hold for dis­con­tin­u­ous bounded non­de­gen­er­ate .

    The power-seek­ing ar­gu­ment works for ar­bi­trary dis­tri­bu­tions over re­ward func­tions (with in­stru­men­tal con­ver­gence also be­ing defined with re­spect to that dis­tri­bu­tion) – iden­ti­cal dis­tri­bu­tion en­forces “fair­ness” over the differ­ent parts of the en­vi­ron­ment. It’s not as if in­stru­men­tal con­ver­gence might not ex­ist for ar­bi­trary dis­tri­bu­tions – it’s just that proofs for them are less in­for­ma­tive (be­cause we don’t know their struc­ture a pri­ori).

    For ex­am­ple, with­out iden­ti­cal dis­tri­bu­tion, we can’t say that agents (roughly) tend to pre­serve the abil­ity to reach as many 1-cy­cles as pos­si­ble; af­ter all, you could just dis­tribute re­ward on an ar­bi­trary 1-cy­cle and 0 re­ward for all other states. Ac­cord­ing to this “dis­tri­bu­tion”, only mov­ing to­wards the 1-cy­cle is in­stru­men­tally con­ver­gent. ↩︎

  3. Power is not the same thing as num­ber of pos­si­bil­ities! Power is av­er­age at­tain­able util­ity; you might have a lot of pos­si­bil­ities, but not be able to choose be­tween them for a long time, which de­creases your con­trol over the (dis­counted) fu­ture.

    Also, re­mem­ber that we’re as­sum­ing du­al­is­tic agency: the agent can choose what­ever se­quence of ac­tions it wants. That is, there aren’t “pos­si­bil­ities” it’s un­able to take. ↩︎

  4. In­for­mal defi­ni­tion of “power” sug­gested by Co­hen et al.. ↩︎

  5. We need to take care when ap­ply­ing the­o­rems to real life, es­pe­cially since the power-seek­ing the­o­rem as­sumes the state is fully ob­serv­able. Ob­vi­ously, this isn’t true in real life, but it seems rea­son­able to ex­pect the the­o­rem to gen­er­al­ize ap­pro­pri­ately. ↩︎

  6. I’ll talk more in fu­ture posts about why I presently think power-seek­ing is the worst part of goal-di­rected agency. ↩︎