Generalizing the Power-Seeking Theorems

Pre­vi­ously: Seek­ing Power is Often Prov­ably In­stru­men­tally Con­ver­gent in MDPs.

Thanks to Ro­hin Shah, Michael Den­nis, Josh Turner, and Evan Hub­inger for com­ments.


It sure seems like gain­ing power over the en­vi­ron­ment is in­stru­men­tally con­ver­gent (op­ti­mal for a wide range of agent goals). You can turn this into math and prove things about it. Given some dis­tri­bu­tion over agent goals, we want to be able to for­mally de­scribe how op­ti­mal ac­tion tends to flow through the fu­ture.

Does gain­ing money tend to be op­ti­mal? Avoid­ing shut­down? When? How do we know?

Op­ti­mal Far­sighted Agents Tend to Seek Power proved that, when you dis­tribute re­ward fairly and evenly across states (IID), it’s in­stru­men­tally con­ver­gent to gain ac­cess to lots of fi­nal states (which are ab­sorb­ing, in that the agent keeps on ex­pe­rienc­ing the fi­nal state). The the­o­rems ap­ply when you don’t dis­count the fu­ture (you’re “in­finitely far­sighted”).

Most re­ward func­tions for the Pac-Man game in­cen­tivize not dy­ing im­me­di­ately, so that the agent can loop around higher-scor­ing con­figu­ra­tions.
Many ways of scor­ing Tic-Tac-Toe game states in­cen­tivize not los­ing im­me­di­ately, in or­der to choose the high­est-scor­ing fi­nal con­figu­ra­tion.
“All states have self-loops, left hid­den to re­duce clut­ter.

In AI: A Modern Ap­proach (3e), the agent starts at and re­ceives re­ward for reach­ing . The op­ti­mal policy for this re­ward func­tion avoids , and one might sus­pect that avoid­ing is in­stru­men­tally con­ver­gent. How­ever, a skep­tic might provide a re­ward func­tion for which nav­i­gat­ing to is op­ti­mal, and then ar­gue that “in­stru­men­tal con­ver­gence″ is sub­jec­tive and that there is no rea­son­able ba­sis for con­clud­ing that is gen­er­ally avoided.

We can do bet­ter… for any way of in­de­pen­dently and iden­ti­cally dis­tribut­ing re­ward over states, of re­ward func­tions have far­sighted op­ti­mal poli­cies which avoid . If we com­pli­cate the MDP with ad­di­tional ter­mi­nal states, this num­ber fur­ther ap­proaches 1.

If we sup­pose that the agent will be forced into un­less it takes pre­ven­ta­tive ac­tion, then pre­ven­ta­tive poli­cies are op­ti­mal for of far­sighted agents – no mat­ter how com­plex the pre­ven­ta­tive ac­tion. Tak­ing to rep­re­sent shut­down, we see that avoid­ing shut­down is in­stru­men­tally con­ver­gent in any MDP rep­re­sent­ing a real-world task and con­tain­ing a shut­down state. We ar­gue that this is a spe­cial case of a more gen­eral phe­nomenon: op­ti­mal far­sighted agents tend to seek power.”

~ Op­ti­mal Far­sighted Agents Tend to Seek Power

While it’s good to un­der­stand the limit­ing case, what if the agent, you know, isn’t in­finitely far­sighted? That’s a pretty un­re­al­is­tic as­sump­tion. Even­tu­ally, we want this the­ory to help us pre­dict what hap­pens af­ter we de­ploy RL agents with high-perform­ing poli­cies in the real world.

Nor­mal amounts of sightedness

But what if we care about the jour­ney? What if ?

We can view Frank as travers­ing a Markov de­ci­sion pro­cess, nav­i­gat­ing be­tween states with his ac­tions:

Re­ward is IID, so the gold-heap state doesn’t have an in­trin­si­cally more gen­er­ous re­ward dis­tri­bu­tion than the cas­tle-and-dragon state.

It sure seems like Frank is more likely to start with the blue or green gems. Those give him way more choices along the way, af­ter all. But the pre­vi­ous the­o­rems only said “at , he’s equally likely to pick each gem. At , he’s equally likely to end up in each ter­mi­nal state”.

Not helpful.

Let me tell you, find­ing the prob­a­bil­ity that one tan­gled web of choices is op­ti­mal over an­other web, is gen­er­ally a huge mess. You’re find­ing the mea­sure of re­ward func­tions which satisfy some messy sys­tem of in­equal­ities, like

And that’s in the sim­ple tiny en­vi­ron­ments!

How do we rea­son about in­stru­men­tal con­ver­gence – how do we find those sets of tra­jec­to­ries which are more likely to be op­ti­mal for a lot of re­ward func­tions?

We ex­ploit sym­me­tries.

There ex­ists a graph iso­mor­phism be­tween this blue-gem-sub­graph and the red-gem-graph, such that the iso­mor­phism leaves Frank where he is.

The blue gem makes available all of the same op­tions as the red gems, and then some. Since the blue gem gives you strictly more op­tions, it’s strictly more likely to be op­ti­mal! When you toss back in the green gem, avoid­ing the red gems be­comes yet more likely.

So, we can prove that for all , most agents don’t choose the red gems. Agents are more likely to pick blue than red. Easy.

Plus, this rea­son­ing mir­rors why we think in­stru­men­tal con­ver­gence ex­ists to be­gin with:

Sure, the goal could in­cen­tivize im­me­di­ately ini­ti­at­ing shut­down pro­ce­dures. But if you stay ac­tive, you could still shut down later, plus there are all these other states the agent might be in­cen­tivized to reach.

This ex­tends fur­ther. If the sym­me­try oc­curs twice over, then you can con­clude the agent is at least twice as likely to do the in­stru­men­tally con­ver­gent thing.

Re­lax­ation summary

My ini­tial work made a lot of sim­plify­ing as­sump­tions:

  • The agents are in­finitely far­sighted: they care about av­er­age re­ward over time, and don’t pri­ori­tize the pre­sent over the fu­ture.

    • Re­laxed. See above.

  • The en­vi­ron­ment is de­ter­minis­tic.

    • Re­laxed. The pa­per is already up­dated to han­dle stochas­tic en­vi­ron­ments. The new tech­niques in this post also gen­er­al­ize straight­for­wardly.

  • Re­ward is dis­tributed IID over states, where each state’s re­ward dis­tri­bu­tion is bounded and con­tin­u­ous.

    • Re­laxed. We can im­me­di­ately toss out bound­ed­ness, as none of our rea­son­ing about in­stru­men­tal con­ver­gence re­lies on it. It just en­sured cer­tain un­re­lated equa­tions were well-defined.

    • With a bit of work, I could prob­a­bly toss out con­ti­nu­ity in gen­eral (and in­stead re­quire only non-de­gen­er­acy), but I haven’t done that yet.

    • If you can prove in­stru­men­tal con­ver­gence un­der IID re­ward, and then you have an­other re­ward func­tion dis­tri­bu­tion which im­proves re­ward for in­stru­men­tally con­ver­gent tra­jec­to­ries while wors­en­ing re­ward for already-un­likely tra­jec­to­ries, then there’s also in­stru­men­tal con­ver­gence un­der .

      • For ex­am­ple, if you dou­ble re­ward in in­stru­men­tally con­ver­gent states and halve it in un­likely states, then you still have in­stru­men­tal con­ver­gence.

  • The en­vi­ron­ment is Markov.

    • Re­laxed. -step Marko­vian en­vi­ron­ments are han­dled by con­ver­sion into iso­mor­phic Markov en­vi­ron­ments.

  • The agent is op­ti­mal.

  • The en­vi­ron­ment is finite and fully ob­serv­able.

The power-seek­ing the­o­rems ap­ply to: in­finitely far­sighted op­ti­mal poli­cies in finite de­ter­minis­tic MDPs with re­spect to re­ward dis­tributed in­de­pen­dently, iden­ti­cally, con­tin­u­ously, and bound­edly over states.

Conclusion

We now have a few for­mally cor­rect strate­gies for show­ing in­stru­men­tal con­ver­gence, or lack thereof.

  • In de­ter­minis­tic en­vi­ron­ments, there’s no in­stru­men­tal con­ver­gence at for IID re­ward.

  • When , you’re strictly more likely to nav­i­gate to parts of the fu­ture which give you strictly more op­tions (in a graph-the­o­retic sense). Plus, these parts of the fu­ture give you strictly more power.

  • When , it’s in­stru­men­tally con­ver­gent to ac­cess a wide range of ter­mi­nal states.

    • This can be seen as a spe­cial case of hav­ing “strictly more op­tions”, but you no longer re­quire an iso­mor­phism on the paths lead­ing to the ter­mi­nal states.

Ap­pendix: Proofs

This work builds off of my ini­tial pa­per on power-seek­ing; I’ll re­fer to that as [1].

Defi­ni­tion. Let be a vis­i­ta­tion dis­tri­bu­tion of state , and let con­tain . de­notes the mea­sure of re­ward func­tions un­der dis­tri­bu­tion satis­fy­ing at dis­count rate .

Non-dom­i­nated vis­i­ta­tion dis­tri­bu­tions have pos­i­tive mea­sure and “take” pos­i­tive mea­sure from ev­ery other non-dom­i­nated vis­i­ta­tion dis­tri­bu­tion.

Lemma 1. If , then for all . Fur­ther­more, such that , for all con­tain­ing .

Proof. The first claim was proven in [1]. The sec­ond claim fol­lows by ob­serv­ing that vis­i­ta­tion dis­tri­bu­tions which are non-dom­i­nated with re­spect to all of are also non-dom­i­nated with re­spect to sub­sets (as tak­ing sub­sets win­nows the set of con­straints). Then, use the fact that non-dom­i­nated vis­i­ta­tion dis­tri­bu­tions always have pos­i­tive mea­sure (in par­tic­u­lar, with re­spect to ). QED.

Dis­count rates

Defi­ni­tion. The graph in­duced by a set of vis­i­ta­tion dis­tri­bu­tions con­sists of the states vis­ited and ac­tions taken by at least one of the poli­cies gen­er­at­ing the vis­i­ta­tion dis­tri­bu­tions. This is also referred to as the -graph.

Let be the vis­i­ta­tion dis­tri­bu­tions through the blue gems, and those through the red.

The­o­rem 2 [Strictly more mean­ingful op­tions means strict in­stru­men­tal con­ver­gence and strict power in­crease]. Let be sub­sets of non-dom­i­nated vis­i­ta­tion dis­tri­bu­tions. If the -graph is iso­mor­phic to a sub­graph of the -graph, such that the iso­mor­phism fixes , then for all . If the sub­graph of the -graph is strict, then so is the in­equal­ity.

Fur­ther­more, is more pow­er­ful in the -graph than in the -graph. If the sub­graph of the -graph is strict, then is strictly more pow­er­ful in the -graph.

Proof. The claim fol­lows from sym­me­try; mea­sure must be in­var­i­ant to state re­la­bel­ling, be­cause re­ward is IID. The strict in­equal­ity fol­lows from lemma 1: adding an­other non-dom­i­nated vis­i­ta­tion dis­tri­bu­tion must strictly in­crease and de­crease .

Similarly, the first power claim fol­lows from sym­me­try. Ad­ding an­other non-dom­i­nated vis­i­ta­tion dis­tri­bu­tion must strictly in­crease the power [1]. QED.

Re­ward dis­tri­bu­tion generalization

We de­rive a suffi­cient con­di­tion for in­stru­men­tal con­ver­gence for (cer­tain) non-IID re­ward func­tion dis­tri­bu­tions.

Defi­ni­tion. Distri­bu­tion (with CDF ) dom­i­nates dis­tri­bu­tion (with CDF ) when (when minorizes ). Similarly, dis­tri­bu­tion (with CDF ) is dom­i­nated by dis­tri­bu­tion (with CDF ) when (when ma­jorizes ).

The fol­low­ing in­sight is sim­ple: if you can prove in­stru­men­tal con­ver­gence un­der IID re­wards, and then you have an­other re­ward func­tion dis­tri­bu­tion which im­proves re­ward for in­stru­men­tally con­ver­gent tra­jec­to­ries while wors­en­ing re­ward for already-un­likely tra­jec­to­ries, then there’s also in­stru­men­tal con­ver­gence un­der .

For ex­am­ple, if you dou­ble re­ward in in­stru­men­tally con­ver­gent states and halve it in un­likely states, then you still have in­stru­men­tal con­ver­gence.

The logic goes:

If e.g. avoid­ing shut­down was in­stru­men­tally con­ver­gent for this more gen­er­ous IID dis­tri­bu­tion, but re­al­is­tic dis­tri­bu­tions are far less likely to re­ward shut­down, and a few other tra­jec­to­ries are even more likely to be re­warded. So, it’s still in­stru­men­tally con­ver­gent to avoid shut­down for this more re­al­is­tic task-based dis­tri­bu­tion we have in mind.

The­o­rem 3. Let . Sup­pose that un­der re­ward func­tion dis­tri­bu­tion ,

If all have dom­i­nant dis­tri­bu­tions at dis­count rate un­der dis­tri­bu­tion com­pared to un­der , and all have dom­i­nated re­turn dis­tri­bu­tions at dis­count rate un­der dis­tri­bu­tion com­pared to un­der , then un­der .

Proof. Con­sider the pro­cess of start­ing with the ini­tial re­turn dis­tri­bu­tions, and iter­a­tively swap­ping them to their more gen­er­ous coun­ter­parts. If any such swap strictly in­creases , it strictly in­creases and strictly de­creases by lemma 1. Clearly such a swap can­not strictly de­crease .

Similar logic ap­plies to the less gen­er­ous re­turn dis­tri­bu­tions for un­der . QED.