The strategy-stealing assumption

Sup­pose that 1% of the world’s re­sources are con­trol­led by un­al­igned AI, and 99% of the world’s re­sources are con­trol­led by hu­mans. We might hope that at least 99% of the uni­verse’s re­sources end up be­ing used for stuff-hu­mans-like (in ex­pec­ta­tion).

Jes­sica Tay­lor ar­gued for this con­clu­sion in Strate­gies for Coal­i­tions in Unit-Sum Games: if the hu­mans di­vide into 99 groups each of which ac­quires in­fluence as effec­tively as the un­al­igned AI, then by sym­me­try each group should end, up with as much in­fluence as the AI, i.e. they should end up with 99% of the in­fluence.

This ar­gu­ment rests on what I’ll call the strat­egy-steal­ing as­sump­tion: for any strat­egy an un­al­igned AI could use to in­fluence the long-run fu­ture, there is an analo­gous strat­egy that a similarly-sized group of hu­mans can use in or­der to cap­ture a similar amount of flex­ible in­fluence over the fu­ture. By “flex­ible” I mean that hu­mans can de­cide later what to do with that in­fluence — which is im­por­tant since hu­mans don’t yet know what we want in the long run.

Why might the strat­egy-steal­ing as­sump­tion be true?

To­day there are a bunch of hu­mans, with differ­ent prefer­ences and differ­ent kinds of in­fluence. Crudely speak­ing, the long-term out­come seems to be de­ter­mined by some com­bi­na­tion of {which prefer­ences have how much in­fluence?} and {what is the space of re­al­iz­able out­comes?}.

I ex­pect this to be­come more true over time — I ex­pect groups of agents with di­verse prefer­ences to even­tu­ally ap­proach effi­cient out­comes, since oth­er­wise there are changes that ev­ery agent would pre­fer (though this is not ob­vi­ous, es­pe­cially in light of bar­gain­ing failures). Then the ques­tion is just about which of these effi­cient out­comes we pick.

I think that our ac­tions don’t effect the space of re­al­iz­able out­comes, be­cause long-term re­al­iz­abil­ity is mostly de­ter­mined by facts about dis­tant stars that we can’t yet in­fluence. The ob­vi­ous ex­cep­tion is that if we colonize space faster, we will have ac­cess more re­sources. But quan­ti­ta­tively this doesn’t seem like a big con­sid­er­a­tion, be­cause as­tro­nom­i­cal events oc­cur over mil­lions of mil­len­nia while our de­ci­sions only change coloniza­tion timelines by decades.

So I think our de­ci­sions mostly af­fect long-term out­comes by chang­ing the rel­a­tive weights of differ­ent pos­si­ble prefer­ences (or by caus­ing ex­tinc­tion).

To­day, one of the main ways that prefer­ences have weight is be­cause agents with those prefer­ences con­trol re­sources and other forms of in­fluence. Strat­egy-steal­ing seems most pos­si­ble for this kind of plan — an al­igned AI can ex­actly copy the strat­egy of an un­al­igned AI, ex­cept the money goes into the al­igned AI’s bank ac­count in­stead. The same seems true for most kinds of re­source gath­er­ing.

There are lots of strate­gies that give in­fluence to other peo­ple in­stead of helping me. For ex­am­ple, I might prefer­en­tially col­lab­o­rate with peo­ple who share my val­ues. But I can still steal these strate­gies, as long as my val­ues are just as com­mon as the val­ues of the per­son I’m try­ing to steal from. So a ma­jor­ity can steal strate­gies from a minor­ity, but not the other way around.

There can be plenty of strate­gies that don’t in­volve ac­quiring re­sources or flex­ible in­fluence. For ex­am­ple, we could have a par­li­a­ment with ob­scure rules in which I can make ma­neu­vers that ad­van­tage one set of val­ues or an­other in a way that can’t be stolen. Strat­egy-steal­ing may only be pos­si­ble at the level of groups — you need to re­tain the op­tion of set­ting up a differ­ent par­li­a­men­tary sys­tem that doesn’t fa­vor par­tic­u­lar val­ues. Even then, it’s un­clear whether strat­egy-steal­ing is pos­si­ble.

There isn’t a clean ar­gu­ment for strat­egy-steal­ing, but I think it seems plau­si­ble enough that it’s mean­ingful and pro­duc­tive to think of it as a plau­si­ble de­fault, and to look at ways it can fail. (If you found enough ways it could fail, you might even­tu­ally stop think­ing of it as a de­fault.)

Eleven ways the strat­egy-steal­ing as­sump­tion could fail

In this sec­tion I’ll de­scribe some of the failures that seem most im­por­tant to me, with a fo­cus on the ones that would in­terfere with the ar­gu­ment in the in­tro­duc­tion.

1. AI alignment

If we can build smart AIs, but not al­igned AIs, then hu­mans can’t nec­es­sar­ily use AI to cap­ture flex­ible in­fluence. I think this is theπ most im­por­tant way in which strat­egy-steal­ing is likely to fail. I’m not go­ing to spend much time talk­ing about it here be­cause I’ve spent so much time el­se­where.

For ex­am­ple, if smart AIs in­evitably want to fill the uni­verse with pa­per­clips, then “build a re­ally smart AI” is a good strat­egy for some­one who wants to fill the uni­verse with pa­per­clips, but it can’t be eas­ily stolen by some­one who wants any­thing else.

2. Value drift over generations

The val­ues of 21st cen­tury hu­mans are de­ter­mined by some com­pli­cated mix of hu­man na­ture and the mod­ern en­vi­ron­ment. If I’m a 16th cen­tury no­ble who has re­ally spe­cific prefer­ences about the fu­ture, it’s not re­ally clear how I can act on those val­ues. But if I’m a 16th cen­tury no­ble who thinks that fu­ture gen­er­a­tions will in­evitably be wiser and should get what they want, then I’m in luck, all I need to do is wait and make sure our civ­i­liza­tion doesn’t do any­thing rash. And if I have some kind of crude in­ter­me­di­ate prefer­ences, then I might be able to push our cul­ture in ap­pro­pri­ate di­rec­tions or en­courage peo­ple with similar ge­netic dis­po­si­tions to have more kids.

This is the most ob­vi­ous and im­por­tant way that strat­egy-steal­ing has failed his­tor­i­cally. It’s not some­thing I per­son­ally worry about too much though.

The big rea­son I don’t worry is some com­bi­na­tion of com­mon-sense moral­ity and de­ci­sion-the­ory: our val­ues are the product of many gen­er­a­tions each giv­ing way to the next one, and so I’m pretty in­clined to “pay it for­ward.” Put a differ­ent way, I think it’s rel­a­tively clear I should em­pathize with the next gen­er­a­tion since I might well have been in their place (whereas I find it much less clear un­der what con­di­tions I should em­pathize with AI). Or from yet an­other per­spec­tive, the same in­tu­ition that I’m “more right” than pre­vi­ous gen­er­a­tions makes me very open to the pos­si­bil­ity that fu­ture gen­er­a­tions are more right still. This ques­tion gets very com­plex, but my first-pass take is that I’m maybe an or­der of mag­ni­tude less wor­ried than about other kinds of value drift.

The small rea­son I don’t worry is that I think this dy­namic is prob­a­bly go­ing to be less im­por­tant in the fu­ture (un­less we ac­tively want it to be im­por­tant — which seems quite pos­si­ble). I be­lieve there is a good chance that within 60 years most de­ci­sions will be made by ma­chines, and so the han­dover from one gen­er­a­tion to the next will be op­tional.

That all said, I am some­what wor­ried about more “out of dis­tri­bu­tion” changes to the val­ues of fu­ture gen­er­a­tions, in sce­nar­ios where AI de­vel­op­ment is slower than I ex­pect. For ex­am­ple, I think it’s pos­si­ble that ge­netic en­g­ineer­ing of hu­mans will sub­stan­tially change what we want, and that I should be less ex­cited about that kind of drift. Or I can imag­ine the in­ter­ac­tion be­tween tech­nol­ogy and cul­ture caus­ing similarly alien changes. Th­ese ques­tions are even harder to think about than the ba­sic ques­tion of “how much should I em­pathize with fu­ture gen­er­a­tions?” which already seemed quite thorny, and I don’t re­ally know what I’d con­clude if I spent a long time think­ing. But at any rate, these things are not at the top of my pri­or­ity queue.

3. Other al­ign­ment problems

AIs and fu­ture gen­er­a­tions aren’t the only op­ti­miz­ers around. For ex­am­ple, we can also build in­sti­tu­tions that fur­ther their own agen­das. We can then face a prob­lem analo­gous to AI al­ign­ment — if it’s eas­ier to build effec­tive in­sti­tu­tions with some kinds of val­ues than oth­ers, then those val­ues could be at a struc­tural ad­van­tage. For ex­am­ple, we might in­evitably end up with a so­ciety that op­ti­mizes gen­er­al­iza­tions of short-term met­rics, if big groups of hu­mans are much more effec­tive when do­ing this. (I say “gen­er­al­iza­tions of short-term met­rics” be­cause an ex­clu­sive fo­cus on short-term met­rics is the kind of prob­lem that can fix it­self over the very long run.)

I think that in­sti­tu­tions are cur­rently con­sid­er­ably weaker than hu­mans (in the sense that’s rele­vant to strat­egy-steal­ing) and this will prob­a­bly re­main true over the medium term. For ex­am­ple:

  • A com­pany with 10,000 peo­ple might be much smarter than any in­di­vi­d­ual hu­mans, but mostly that’s be­cause of its al­li­ance with its em­ploy­ees and share­hold­ers — most of its in­fluence is just used to ac­cu­mu­late more wages and div­i­dends. Com­pa­nies do things that seem an­ti­so­cial not be­cause they have come un­moored from any hu­man’s val­ues, but be­cause plenty of in­fluen­tial hu­mans want them to do that in or­der to make more money. (You could try to point the “mar­ket” as an or­ga­ni­za­tion with its own prefer­ences, but it’s even worse at defend­ing it­self than bu­reau­cra­cies — it’s up to hu­mans who benefit from the mar­ket to defend it.)

  • Bureau­cra­cies can seem un­moored from any in­di­vi­d­ual hu­man de­sire. But their ac­tual abil­ity to defend them­selves and ac­quire re­sources seems much weaker than other op­ti­miz­ers like hu­mans or cor­po­ra­tions.

Over­all I’m less con­cerned about this than AI al­ign­ment, but I do think it is a real prob­lem. I’m some­what op­ti­mistic that the same gen­eral prin­ci­ples will be rele­vant both to al­ign­ing in­sti­tu­tions and AIs. If AI al­ign­ment wasn’t an is­sue, I’d be more con­cerned by prob­lems like in­sti­tu­tional al­ign­ment.

4. Hu­man fragility

If AI sys­tems are al­igned with hu­mans, they may want to keep hu­mans al­ive. Not only do hu­mans pre­fer be­ing al­ive, hu­mans may need to sur­vive if they want to have the time and space to figure out what they re­ally want and to tell their AI what to do. (I say “may” be­cause at some point you might imag­ine e.g. putting some hu­mans in cold stor­age, to be re­vived later.)

This could in­tro­duce an asym­me­try: an AI that just cares about pa­per­clips can get a leg up on hu­mans by threat­en­ing to re­lease an en­g­ineered plague, or trash­ing nat­u­ral ecosys­tems that hu­mans rely on. (Of course, this asym­me­try may also go the other way — val­ues im­ple­mented in ma­chines are re­li­ant on a bunch of com­plex in­fras­truc­ture which may be more or less of a li­a­bil­ity than hu­man­ity’s re­li­ance on ecosys­tems.)

Step­ping back, I think the fun­da­men­tal long-term prob­lem here is that “do what this hu­man wants” is only a sim­ple de­scrip­tion of hu­man val­ues if you ac­tu­ally have the hu­man in hand, and so an agent with these val­ues does have a big ex­tra li­a­bil­ity.

I do think that the ex­treme op­tion of “stor­ing” hu­mans to re­vive them later is work­able, though most peo­ple would be very un­happy with a world where that be­comes nec­es­sary. (To be clear, I think it al­most cer­tainly won’t.) We’ll re­turn to this un­der “short-term ter­mi­nal prefer­ences” be­low.

5. Per­sua­sion as fragility

If an al­igned AI defines its val­ues with refer­ence to “what­ever Paul wants,” then some­one doesn’t need to kill Paul to mess with the AI, they just need to change what Paul wants. If it’s very easy to ma­nipu­late hu­mans, but we want to keep talk­ing with each other and in­ter­act­ing with the world de­spite the risk, then this ex­tra at­tack sur­face could be­come a huge li­a­bil­ity.

This is eas­ier to defend against — just stop talk­ing with peo­ple ex­cept in ex­tremely con­trol­led en­vi­ron­ments where you can min­i­mize the risk of ma­nipu­la­tion — but again hu­mans may not be will­ing to pay that cost.

The main rea­son this might be worse than point 4 is that hu­mans may be rel­a­tively happy to phys­i­cally iso­late them­selves from any­thing scary, but it would be much more costly for us to cut off from con­tact with other hu­mans.

6. Asym­met­ric persuasion

Even if hu­mans are the only op­ti­miz­ers around, it might be eas­ier to per­suade hu­mans of some things than oth­ers. For ex­am­ple, you could imag­ine a world where it’s eas­ier to con­vince hu­mans to en­dorse a sim­ple ide­ol­ogy like “max­i­mize the com­plex­ity of the uni­verse” than to con­vince hu­mans to pur­sue some more com­plex and sub­tle val­ues.

This means that peo­ple with eas­ily-per­suad­able val­ues can use per­sua­sion as a strat­egy, and peo­ple with other val­ues can­not copy it.

I think this is ul­ti­mately more im­por­tant than frag­ility, be­cause it is rele­vant be­fore we have pow­er­ful AI sys­tems. It has many similar­i­ties to “value drift over gen­er­a­tions,” and I have some mixed feel­ings here as well — there are some kinds of ar­gu­ment and de­liber­a­tion that I cer­tainly do en­dorse, and to the ex­tent that my cur­rent views are the product of sig­nifi­cant amounts of non-en­dorsed de­liber­a­tion I am more in­clined to be em­pa­thetic to fu­ture peo­ple who are in­fluenced by in­creas­ingly-so­phis­ti­cated ar­gu­ments.

But as I de­scribed in sec­tion 2, I think these con­nec­tions can get weaker as tech­nolog­i­cal progress moves us fur­ther out of dis­tri­bu­tion, and if you told me that e.g. it was pos­si­ble to perform a brute force search and find an ar­gu­ment that could con­vince some­one to max­i­mize the com­plex­ity of the fu­ture, I wouldn’t con­clude that it’s prob­a­bly fine if they de­cided to do that.

(Credit to Wei Dai for em­pha­siz­ing this failure mode.)

7. Value-sen­si­tive bargaining

If a bunch of pow­er­ful agents col­lec­tively de­cide what to do with the uni­verse, I think it prob­a­bly won’t look like “they all con­trol their own slice of the uni­verse and make in­de­pen­dent de­ci­sions about what to do.” There will likely be op­por­tu­ni­ties for trade, they may have med­dling prefer­ences (where I care what you do with your part of the uni­verse), there may be a pos­si­bil­ity of de­struc­tive con­flict, or it may look com­pletely differ­ent in an unan­ti­ci­pated way.

In many of these set­tings the out­come is in­fluenced by a com­pli­cated bar­gain­ing game, and it’s un­clear whether the ma­jor­ity can steal a minor­ity’s strat­egy. For ex­am­ple, sup­pose that there are two val­ues X and Y in the world, with 99% X-agents and 1% Y-agents. The Y-agents may be able to threaten to de­stroy the world un­less there is an even split, and the X-agents have no way to copy such a strat­egy. (This could also oc­cur over the short term.)

I don’t have a strong view about the sever­ity of this prob­lem. I could imag­ine it be­ing a big deal.

8. Recklessness

Some prefer­ences might not care about whether the world is de­stroyed, and there­fore have ac­cess to pro­duc­tive but risky strate­gies that more cau­tious agents can­not copy. The same could hap­pen with other kinds of risks, like com­mit­ments that are game-the­o­ret­i­cally use­ful but risk sac­ri­fic­ing some part of the uni­verse or cre­at­ing long-term nega­tive out­comes.

I tend to think about this prob­lem in the con­text of par­tic­u­lar tech­nolo­gies that pose an ex­tinc­tion risk, but it’s worth keep­ing in mind that it can be com­pounded by the ex­is­tence of more reck­less agents.

Over­all I think this isn’t a big deal, be­cause it seems much eas­ier to cause ex­tinc­tion by try­ing to kill ev­ery­one than as an ac­ci­dent. There are fewer peo­ple who are in fact try­ing to kill ev­ery­one, but I think not enough fewer to tip the bal­ance. (This is a con­tin­gent fact about tech­nol­ogy though; it could change in the fu­ture and I could eas­ily be wrong even to­day.)

9. Short-term unity and coordination

Some ac­tors may have long-term val­ues that are eas­ier to talk about, rep­re­sent for­mally, or rea­son about. Rel­a­tive to hu­mans, AIs may be es­pe­cially likely to have such val­ues. Th­ese ac­tors could have an eas­ier time co­or­di­nat­ing, e.g. by pur­su­ing some ex­plicit com­pro­mise be­tween their val­ues (rather than be­ing forced to find a gov­er­nance mechanism for some re­sources pro­duced by a joint ven­ture).

This could leave us in a place where e.g. an un­al­igned AI con­trols 1% re­sources, but the ma­jor­ity of re­sources are con­trol­led by hu­mans who want to ac­quire flex­ible re­sources. Then the un­al­igned AIs can form a coal­i­tion which achieves very high effi­cien­cies, while the hu­mans can­not form 99 other coal­i­tions to com­pete.

This could the­o­ret­i­cally be a prob­lem with­out AI, e.g. a large group of hu­man with shared ex­plicit val­ues might be able to co­or­di­nate bet­ter and so leave nor­mal hu­mans at a dis­ad­van­tage, though I think this is rel­a­tively un­likely as a ma­jor force in the world.

The se­ri­ous­ness of this prob­lem is bounded by both the effi­ciency gains for a large coal­i­tion, and the qual­ity of gov­er­nance mechanisms for differ­ent ac­tors who want to ac­quire flex­ible re­sources. I think we have OK solu­tions for co­or­di­na­tion be­tween peo­ple who want flex­ible in­fluence, such that I don’t think this will be a big prob­lem:

  • The hu­mans can par­ti­ci­pate in lot­ter­ies to con­cen­trate in­fluence. Or you can gather re­sources to be used for a lot­tery in the fu­ture, while still al­low­ing time for peo­ple to be­come wiser and then make bar­gains about what to do with the uni­verse be­fore they know who wins.

  • You can di­vide up the re­sources pro­duced by a coal­i­tion equitably (and then ne­go­ti­ate about what to do with them).

  • You can mod­ify other mechanisms by al­low­ing votes that could e.g. over­rule cer­tain uses of re­sources. You could have more com­plex gov­er­nance mechanisms, can del­e­gate differ­ent kinds of au­thor­ity to differ­ent sys­tems, can rely on trusted par­ties, etc.

  • Many of these pro­ce­dures work much bet­ter amongst groups of hu­mans who ex­pect to have rel­a­tively similar prefer­ences or have a rea­son­able level of trust for other par­ti­ci­pants to do some­thing ba­si­cally co­op­er­a­tive and friendly (rather than e.g. de­mand­ing con­ces­sions so that they don’t do some­thing ter­rible with their share of the uni­verse or if they win the even­tual lot­tery).

(Credit to Wei Dai for de­scribing and em­pha­siz­ing this failure mode.)

10. Weird stuff with simulations

I think civ­i­liza­tions like ours mostly have an im­pact via the com­mon-sense chan­nel where we ul­ti­mately colonize space. But there may be many civ­i­liza­tions like ours in simu­la­tions of var­i­ous kinds, and in­fluenc­ing the re­sults of those simu­la­tions could also be an im­por­tant part of what we do. In that case, I don’t have any par­tic­u­lar rea­son to think strat­egy-steal­ing breaks dow but I think stuff could be very weird and I have only a weak sense of how this in­fluences op­ti­mal strate­gies.

Over­all I don’t think much about this since it doesn’t seem likely to be a large part of our in­fluence and it doesn’t break strat­egy-steal­ing in an ob­vi­ous way. But I think it’s worth hav­ing in mind.

11. Other preferences

Peo­ple care about lots of stuff other than their in­fluence over the long-term fu­ture. If 1% of the world is un­al­igned AI and 99% of the world is hu­mans, but the AI spends all of its re­sources on in­fluenc­ing the fu­ture while the hu­mans only spend one tenth, it wouldn’t be too sur­pris­ing if the AI ended up with 10% of the in­fluence rather than 1%. This can mat­ter in lots of ways other than literal spend­ing and sav­ing: some­one who only cared about the fu­ture might make differ­ent trade­offs, might be will­ing to defend them­selves at the cost of short-term value (see sec­tions 4 and 5 above), might pur­sue more ruth­less strate­gies for ex­pan­sion, and so on.

I think the sim­plest ap­prox­i­ma­tion is to re­strict at­ten­tion to the part of our prefer­ences that is about the long-term (I dis­cussed this a bit in Why might the fu­ture be good?). To the ex­tent that some­one cares about the long-term less than the av­er­age ac­tor, they will rep­re­sent a smaller frac­tion of this “long-term prefer­ences” mix­ture. This may give un­al­igned AI sys­tems a one-time ad­van­tage for in­fluenc­ing the long-term fu­ture (if they care more about it) but doesn’t change the ba­sic dy­nam­ics of strat­egy-steal­ing. Even this ad­van­tage might be clawed back by a ma­jor­ity (e.g. by tax­ing savers).

There are a few places where this pic­ture seems a lit­tle bit less crisp:

  • Rather than be­ing able to spend re­sources on ei­ther the short or long-term, some­times you might have prefer­ences about how you ac­quire re­sources in the short-term; an agent with­out such scru­ples could po­ten­tially pull ahead. If these prefer­ences are strong, it prob­a­bly vi­o­lates strat­egy-steal­ing un­less the ma­jor­ity can agree to crush any­one un­scrupu­lous.

  • For hu­mans in par­tic­u­lar, it may be hard to sep­a­rate out “hu­mans as repos­i­tory of val­ues” from “hu­mans as an ob­ject of prefer­ences,” and this may make it harder for us to defend our­selves (as dis­cussed in sec­tions 4 and 5).

I mostly think these com­plex­ities won’t be a big deal quan­ti­ta­tively, be­cause I think our short-term prefer­ences will mostly be com­pat­i­ble with defense and re­source ac­qui­si­tion. But I’m not con­fi­dent about that.


I think strat­egy-steal­ing isn’t re­ally true; but I think it’s a good enough ap­prox­i­ma­tion that we can ba­si­cally act as if it’s true, and then think about the risk posed by pos­si­ble failures of strat­egy-steal­ing.

I think this is es­pe­cially im­por­tant for think­ing about AI al­ign­ment, be­cause it lets us for­mal­ize the low­ered goal­posts I dis­cussed here: we just want to en­sure that AI is com­pat­i­ble with strat­egy-steal­ing. Th­ese low­ered goal­posts are an im­por­tant part of why I think we can solve al­ign­ment.

In prac­tice I think that a large coal­i­tion of hu­mans isn’t re­duced to strat­egy-steal­ing — a ma­jor­ity can sim­ply stop a minor­ity from do­ing some­thing bad, rather than by copy­ing it. The pos­si­ble failures in this post could po­ten­tially be ad­dressed by ei­ther a tech­ni­cal solu­tion or some kind of co­or­di­na­tion.