A distillation of my understanding of the commitment races problem.

Greaser Courage

It’s 1950 and you’re a greaser in a greaser gang. Recreationally, you’re driving your heavy American cars (without seatbelts) at each other at speed, and seeing who swerves first. Whoever swerves first is a coward and is “chicken”; their opponent is courageous and the victor. Both drivers swerving is less humiliating than just you swerving. Both not swerving means both drivers die.

The bolts on your steering wheel have been loosened in the chop shop, and so your steering wheel can be removed if you pull it out towards you.

If you remove your steering wheel and throw it prominently out your window, then your opponent will see this and realize that you are now incapable of swerving. They will then swerve, as they prefer humiliation to death. This wins you glory that you live to see.

But if you both simultaneously throw your steering wheels out the window, then neither of you will be able to slow down in time and both of you will die.

Commitment Races

The two above greasers are thus in a situation where one can individually do better by throwing out their steering wheel quickly, but fares worse if both adopt this strategy. Both-drivers-having-their-steering-wheels is a commons that each greaser can take from, but both do poorly if the commons is depleted by over-exhaustion. Greasers with unloosened steering wheels don’t share a both-drivers-having-their-steering-wheels commons—because each greaser can commit ahead of time to not swerving, this commons exists. Because the greasers are both itching to commit first to not swerving, we say that the greasers playing chicken with loosened steering wheels are in a commitment race with each other.

Besides throwing out steering wheels, other kinds of actions allow agents to precommit to act ahead of time, in view of their opponents. If you can alter your source code so that you will definitely pass up a somewhat desirable contract if your ideal terms aren’t met, you’ll be offered better contracts than agents that can’t alter their source code. Alice the human and Bot the human-intelligence AGI are trading with each other. Bot has edit access to his own source code; Alice does not have access to hers. Before beginning any trading negotiations with Alice, Bot is sure to modify his own source code so that he won’t accept less than almost all of the value pie. Bot then shows this self-modification to Alice, before going to trade. Alice will now offer Bot a trade where she gets almost nothing and Bot gets almost everything: Alice still prefers a trade where she gets almost nothing to a trade where she gets literally nothing.

Something perverse has happened here! Before self-modifying, Bot had a huge panoply of possible contracts that he could agree to or reject. After self-modifying, Bot had strictly fewer options available to him. Bot did better in life by throwing away options that he previously had! Alice and Bot entered into a trade relationship because they both understand the notion of positive-sum interactions; they’re both smart, sophisticated agents. Alice brought all those smarts with her into the trading room. Bot threw some of his options away and made his future self effectively dumber. Usually, when we study rationality we find that intelligence is good for finding and choosing the best option out of a big set of alternatives. Usually, smart rationalists want more options because that means a greater chance of the alternatives including an even better option. Smart rationalists want a lot of options, because that gives them more possible stabs at a better alternative, and then want their reasoning to steer their final decision after sifting through those alternatives. Being smarter is ordinarily good for quickly and accurately sifting through larger option spaces. With commitment races, being smarter and using that to sift through options is a losing move. Being smart in a commitment race is, unusually, a losing position—you win in a commitment race to the extent that you can make yourself dumb, fast.

Eliezer:
IMO, commitment races only occur between agents who will, in some sense, act like idiots, if presented with an apparently ‘committed’ agent. If somebody demands $6 from me in the Ultimatum game, threatening to leave us both with $0 unless I offer at least $6 to them… then I offer $6 with slightly less than ⁵⁄₆ probability, so they do no better than if they demanded $5, the amount I think is fair. They cannot evade that by trying to make some ‘commitment’ earlier than I do. I expect that, whatever is the correct and sane version of this reasoning, it generalizes across all the cases.
I am not locked into warfare with things that demand $6 instead of $5. I do not go around figuring out how to invert their utility function for purposes of threatening them back - ‘destroy all utility-function inverters (but do not invert their own utility functions)’ was my guessed commandment that would be taught to kids in dath ilan, because you don’t want reality to end up full of utilityfunction inverters.
From the beginning, I invented timeless decision theory because of being skeptical that two perfectly sane and rational hyperintelligent beings with common knowledge about each other would have no choice but mutual defection in the oneshot prisoner’s dilemma. I suspected they would be able to work out Something Else Which Is Not That, so I went looking for it myself. I suggest cultivating the same suspicion with respect to the imagination of commitment races between Ultimatum Game players, in which whoever manages to make some move logically first walks away with $9 and the other poor agent can only take $1 - especially if you end up reasoning that the computationally weaker agent should be the winner.

Eliezer’s argument above is that rational agents use precommitments to shape their outward-facing incentives profile so that all agents trading with them are incentivized to offer Shapley splits of the value pie. On Eliezer’s view, one should precommit such that any agent like Bot will get a lower EV by offering you almost nothing than by offering you the Shapley split.

Bot has read enough Eliezer on LessWrong to know how Eliezer will precommit. Knowing this, Bot does best by offering Eliezer Shapley splits, and by accepting Eliezer’s Shapley offers. Because Bot knows what Eliezer is like, Bot has no reason to self-modify before trading with Eliezer. Bot happily rips off Alice, but doesn’t rip off Eliezer.

But what if Bot doesn’t yet know much about Eliezer? If Bot reasons that self-modifying is worthwhile in expectation before it reasons about what Eliezer is like as a trader, then Bot will end up offering Eliezer almost nothing anyways. Bot would have moved first in ‘logical time,’ and so Eliezer’s precommitment footwork wouldn’t have logically subjunctively prevented Bot’s self-modification. Moreover, if an as-of-yet ignorant Bot has some premonition that learning more about Eliezer will make Bot encounter truths he’d rather not encounter, Bot can self-modify on the basis of that premonition, before risking reading up on Eliezer on LessWrong. Note that it would not be sufficient for Bot to self-modify to forget any inconvenient truths Bot had learned about Eliezer. Bot’s self-modifications really do have to precede Bot’s understanding Eliezer, as Eliezer’s incentives profile will leave a forgot-and-then-self-modified Bot worse off.

Bot will only win in a commitment race with Eliezer if Bot self-modifies for the wrong reason, in advance of understanding why self-modification is valuable. Bot, if successful, acted on mere premonitions in his prior reasoning about self-modification. Bot got to, and could only get to, a winning state in the logical mental game against Eliezer “by accident.”

Bot will be more and more tempted to self-modify the more it learns about bargaining and decision theory. But Bot will be more and more averse to self-modification against Eliezer as it improves its Eliezer model. If Eliezer takes pains to be easy to model, then Bot will become adverse to self-modifying against Eliezer sooner in Bot’s reflection. But because Eliezer would never precommit to probably turn down a rock with an un-Shapley offer painted on its front (because non-agents bearing fixed offers created ex nihilo cannot be deterred or made less likely through any precommitment) there’s always some state for Bot to stumble into in its path of reflection and self-modification where Bot comes out on top.^[1]

^
H/t James Faville for this idea.

Intelligence in Commitment Races

Greaser Courage

Commitment Races