The “Commitment Races” problem


[Epistemic sta­tus: Strong claims vaguely stated and weakly held. I ex­pect that writ­ing this and di­gest­ing feed­back on it will lead to a much bet­ter ver­sion in the fu­ture.]

This post at­tempts to gen­er­al­ize and ar­tic­u­late a prob­lem that peo­ple have been think­ing about since at least 2016. [Edit: 2009 in fact!] In short, here is the prob­lem:

Con­se­quen­tial­ists can get caught in com­mit­ment races, in which they want to make com­mit­ments as soon as pos­si­ble. When con­se­quen­tial­ists make com­mit­ments too soon, dis­as­trous out­comes can some­times re­sult. The situ­a­tion we are in (build­ing AGI and let­ting it self-mod­ify) may be one of these times un­less we think care­fully about this prob­lem and how to avoid it.

For this post I use “con­se­quen­tial­ists” to mean agents that choose ac­tions en­tirely on the ba­sis of the ex­pected con­se­quences of those ac­tions. For my pur­poses, this means they don’t care about his­tor­i­cal facts such as whether the op­tions and con­se­quences available now are the re­sult of mal­i­cious past be­hav­ior. (I am try­ing to avoid triv­ial defi­ni­tions of con­se­quen­tial­ism ac­cord­ing to which ev­ery­one is a con­se­quen­tial­ist be­cause e.g. “obey­ing the moral law” is a con­se­quence.) This defi­ni­tion is some­what fuzzy and I look for­ward to search­ing for more pre­ci­sion some other day.

Con­se­quen­tial­ists can get caught in com­mit­ment races, in which they want to make com­mit­ments as soon as possible

Con­se­quen­tial­ists are bul­lies; a con­se­quen­tial­ist will hap­pily threaten some­one in­so­far as they think the vic­tim might ca­pitu­late and won’t re­tal­i­ate.

Con­se­quen­tial­ists are also cow­ards; they con­form their be­hav­ior to the in­cen­tives set up by oth­ers, re­gard­less of the his­tory of those in­cen­tives. For ex­am­ple, they pre­dictably give in to cred­ible threats un­less rep­u­ta­tional effects weigh heav­ily enough in their minds to pre­vent this.

In most or­di­nary cir­cum­stances the stakes are suffi­ciently low that rep­u­ta­tional effects dom­i­nate: Even a con­se­quen­tial­ist agent won’t give up their lunch money to a school­yard bully if they think it will in­vite much more bul­ly­ing later. But in some cases the stakes are high enough, or the rep­u­ta­tional effects low enough, for this not to mat­ter.

So, amongst con­se­quen­tial­ists, there is some­times a huge ad­van­tage to “win­ning the com­mit­ment race.” If two con­se­quen­tial­ists are play­ing a game of Chicken, the first one to throw out their steer­ing wheel wins. If one con­se­quen­tial­ist is in po­si­tion to se­ri­ously hurt an­other, it can ex­tract con­ces­sions from the sec­ond by cred­ibly threat­en­ing to do so—un­less the would-be vic­tim cred­ibly com­mits to not give in first! If two con­se­quen­tial­ists are at­tempt­ing to di­vide up a pie or se­lect a game-the­o­retic equil­ibrium to play in, the one that can “move first” can get much more than the one that “moves sec­ond.” In gen­eral, be­cause con­se­quen­tial­ists are cow­ards and bul­lies, the con­se­quen­tial­ist who makes com­mit­ments first will pre­dictably be able to mas­sively con­trol the be­hav­ior of the con­se­quen­tial­ist who makes com­mit­ments later. As the folk the­o­rem shows, this can even be true in cases where games are iter­ated and rep­u­ta­tional effects are sig­nifi­cant.

Note: “first” and “later” in the above don’t re­fer to clock time, though clock time is a helpful metaphor for imag­in­ing what is go­ing on. Really, what’s go­ing on is that agents learn about each other, each on their own sub­jec­tive timeline, while also mak­ing choices (in­clud­ing the choice to com­mit to things) and the choices a con­se­quen­tial­ist makes at sub­jec­tive time t are cravenly sub­mis­sive to the com­mit­ments they’ve learned about by t.

Log­i­cal up­date­less­ness and acausal bar­gain­ing com­bine to cre­ate a par­tic­u­larly im­por­tant ex­am­ple of a dan­ger­ous com­mit­ment race. There are strong in­cen­tives for con­se­quen­tial­ist agents to self-mod­ify to be­come up­date­less as soon as pos­si­ble, and go­ing up­date­less is like mak­ing a bunch of com­mit­ments all at once. Since real agents can’t be log­i­cally om­ni­scient, one needs to de­cide how much time to spend think­ing about things like game the­ory and what the out­puts of var­i­ous pro­grams are be­fore mak­ing com­mit­ments. When we add acausal bar­gain­ing into the mix, things get even more in­tense. Scott Garrabrant, Wei Dai, and Abram Dem­ski have de­scribed this prob­lem already, so I won’t say more about that here. Ba­si­cally, in this con­text, there are many other peo­ple ob­serv­ing your thoughts and mak­ing de­ci­sions on that ba­sis. So bluffing is im­pos­si­ble and there is con­stant pres­sure to make com­mit­ments quickly be­fore think­ing longer. (That’s my take on it any­way)

Anec­dote: Play­ing a board game last week, my friend Lukas said (para­phrase) “I com­mit to mak­ing you lose if you do that move.” In ra­tio­nal­ist gam­ing cir­cles this sort of thing is nor­mal and fun. But I sus­pect his gam­bit would be con­sid­ered un­sports­man­like—and pos­si­bly out­right bul­ly­ing—by most peo­ple around the world, and my com­pli­ance would be con­sid­ered cow­ardly. (To be clear, I didn’t com­ply. Prac­tice what you preach!)

When con­se­quen­tial­ists make com­mit­ments too soon, dis­as­trous out­comes can some­times re­sult. The situ­a­tion we are in may be one of these times.

This situ­a­tion is already ridicu­lous: There is some­thing very silly about two sup­pos­edly ra­tio­nal agents rac­ing to limit their own op­tions be­fore the other one limits theirs. But it gets worse.

Some­times com­mit­ments can be made “at the same time”—i.e. in ig­no­rance of each other—in such a way that they lock in an out­come that is dis­as­trous for ev­ery­one. (Think both play­ers in Chicken throw­ing out their steer­ing wheels si­mul­ta­neously.)

Here is a some­what con­crete ex­am­ple: Two con­se­quen­tial­ist AGI think for a lit­tle while about game the­ory and com­mit­ment races and then self-mod­ify to re­sist and heav­ily pun­ish any­one who bul­lies them. Alas, they had slightly differ­ent ideas about what counts as bul­ly­ing and what counts as a rea­son­able re­quest—per­haps one thinks that de­mand­ing more than the Nash Bar­gain­ing Solu­tion is bul­ly­ing, and the other thinks that de­mand­ing more than the Kalai-Smorod­in­sky Bar­gain­ing Solu­tion is bul­ly­ing—so many years later they meet each other, learn about each other, and end up locked into all-out war.

I’m not say­ing dis­as­trous AGI com­mit­ments are the de­fault out­come; I’m say­ing the stakes are high enough that we should put a lot more thought into pre­vent­ing them than we have so far. It would re­ally suck if we cre­ate a value-al­igned AGI that ends up get­ting into all sorts of fights across the mul­ti­verse with other value sys­tems. We’d wish we built a pa­per­clip max­i­mizer in­stead.

Ob­jec­tion: “Surely they wouldn’t be so stupid as to make those com­mit­ments—even I could see that bad out­come com­ing. A bet­ter com­mit­ment would be...”

Re­ply: The prob­lem is that con­se­quen­tial­ist agents are mo­ti­vated to make com­mit­ments as soon as pos­si­ble, since that way they can in­fluence the be­hav­ior of other con­se­quen­tial­ist agents who may be learn­ing about them. Of course, they will bal­ance these mo­ti­va­tions against the coun­ter­vailing mo­tive to learn more and think more be­fore do­ing dras­tic things. The prob­lem is that the first mo­ti­va­tion will push them to make com­mit­ments much sooner than would oth­er­wise be op­ti­mal. So they might not be as smart as us when they make their com­mit­ments, at least not in all the rele­vant ways. Even if our baby AGIs are wiser than us, they might still make mis­takes that we haven’t an­ti­ci­pated yet. The situ­a­tion is like the cen­tipede game: Col­lec­tively, con­se­quen­tial­ist agents benefit from learn­ing more about the world and each other be­fore com­mit­ting to things. But be­cause they are all bul­lies and cow­ards, they in­di­vi­d­u­ally benefit from com­mit­ting ear­lier, when they don’t know so much.

Ob­jec­tion: “Threats, sub­mis­sion to threats, and costly fights are rather rare in hu­man so­ciety to­day. Why not ex­pect this to hold in the fu­ture, for AGI, as well?”

Re­ply: Sev­eral points:

1. Dev­as­tat­ing com­mit­ments (e.g. “Grim Trig­ger”) are much more pos­si­ble with AGI—just al­ter the code! Inigo Mon­toya is a fic­tional char­ac­ter and even he wasn’t able to sum­mon lifelong com­mit­ment on a whim; it had to be trig­gered by the bru­tal mur­der of his father.

2. Cred­i­bil­ity is much eas­ier also, es­pe­cially in an acausal con­text (see above.)

3. Some AGI bul­lies may be harder to re­tal­i­ate against than hu­mans, low­er­ing their dis­in­cen­tive to make threats.

4. AGI may not have suffi­ciently strong rep­u­ta­tion effects in the sense rele­vant to con­se­quen­tial­ists, partly be­cause threats can be made more dev­as­tat­ing (see above) and partly be­cause they may not be­lieve they ex­ist in a pop­u­la­tion of other pow­er­ful agents who will bully them if they show weak­ness.

5. Fi­nally, these ter­rible things (Bru­tal threats, costly fights) do hap­pen to some ex­tent even among hu­mans to­day—es­pe­cially in situ­a­tions of an­ar­chy. Hope­fully the AGIs we build will be less likely than hu­mans to do these things.

Ob­jec­tion: “Any AGI that falls for this com­mit-now-be­fore-the-oth­ers-do ar­gu­ment will also fall for many other silly do-X-now-be­fore-it’s-too-late ar­gu­ments, and thus will be in­ca­pable of hurt­ing any­one.”

Re­ply: That would be nice, wouldn’t it? Let’s hope so, but not count on it. In­deed per­haps we should look into whether there are other ar­gu­ments of this form that we should worry about our AI fal­ling for...

Anec­dote: A friend of mine, when she was a tod­dler, would threaten her par­ents: “I’ll hold my breath un­til you give me the candy!” Imag­ine how badly things would have gone if she was phys­i­cally ca­pa­ble of mak­ing ar­bi­trary cred­ible com­mit­ments. Mean­while, a few years ago when I first learned about the con­cept of up­date­less­ness, I re­solved to be up­date­less from that point on­wards. I am now glad that I couldn’t ac­tu­ally com­mit to any­thing then.

Conclusion

Over­all, I’m not cer­tain that this is a big prob­lem. But it feels to me that it might be, es­pe­cially if acausal trade turns out to be a real thing. I would not be sur­prised if “solv­ing bar­gain­ing” turns out to be even more im­por­tant than value al­ign­ment, be­cause the stakes are so high. I look for­ward to a bet­ter un­der­stand­ing of this prob­lem.

Many thanks to Abram Dem­ski, Wei Dai, John Went­worth, and Romeo Stevens for helpful con­ver­sa­tions.