[Question] Game theory of “Nuclear Prisoner’s Dilemma”—on nuking rocks

CronoDAS12 May 2025 11:07 UTC

11 points

Game Theory Prisoner's Dilemma Functional Decision Theory Decision theory Rationality Blackmail / Extortion

Eliezer Yudkowsky wrote, in a place where I can’t ask a follow-up question:

A rational agent should always do at least as well for itself as a rock, unless it’s up against some other agent that specifically wants to punish particular decision algorithms and will pay costs itself to do that; just doing what a rock does isn’t very expensive or complicated, so a rational agent which isn’t doing better than a rock should just behave like a rock instead. An agent benefits from building into itself a capacity to respond to positive-sum trade offers; it doesn’t benefit from building into itself a capacity to respond to threats.

Consider the Nuclear Prisoner’s Dilemma, in which as well as Cooperate and Defect there’s a third option called Nuke, which if either player presses it causes both players to get (-100, −100). Suppose that both players are programs each allowed to look at each other’s source code (a la our paper “Robust Cooperation in the Prisoner’s Dilemma”), or political players with track records of doing what they say. If you’re up against a naive counterparty, you can threaten to press Nuke unless the opponent presses Cooperate (in which case you press Defect). But you’d have no reason to ever press Nuke if you were facing a rock; the only reason you’d ever set up a strategy of conditionally pressing Nuke is because of a prediction about how your opponent would respond in a complicated way to that strategy by their pressing Cooperate (even though you would then press Defect, and they’d know that). So a rational agent does not want to build into itself the capacity to respond to threats of Nuke by choosing Cooperate (against Defect); it would rather be a rock. It does want to build into itself a capacity to move from Defect-Defect to Cooperate-Cooperate, if both programs know the other’s code, or two entities with track records can negotiate.

Well, what if I told you that I had a perfectly good reason to to become someone that would threaten to nuke Defection Rock, and that it was because I wanted to make it clear that agents that self-modify into a rock get nuked anyway, so there’s no advantage to adopting a strategy that does something other than playing Cooperate while I play Defect. I want to keep my other victims convinced that surrendering to me is their best option, and nuking the occasional rock is a price I’m willing to pay to achieve that. In other words, I’ve transformed the game you’re playing from Prisoner’s Dilemma to Hawk/Dove, and I’m a rock that always plays Hawk. So what does LDT have to say about that? Are you going to use a strategy that plays “Hawk” (anything other than Cooperate) against a rock that always plays Hawk and gets us both nuked, or are you going to do the sensible thing and play Dove (Cooperate)?

CronoDAS12 May 2025 11:07 UTC

11 points

6 comments2 min readLW link

Game Theory Prisoner's Dilemma Functional Decision Theory Decision theory Rationality Blackmail / Extortion

quetzal_rainbow 12 May 2025 11:45 UTC
4 points
0
Let’s suppose that you look at the code of your counterparty which says “I’ll Nuke you unless you Cooperate in which case I Defect”, call it “extortionist”. You have two hypotheses here:
1. Your counterparty deliberately modified its decision-making procedures in hope to extort more utility;
2. This decision-making procedure is a genuine result of some weird evolutionary/learning process.
If you can’t actually get any evidence in favor of each hypothesis, you go with your prior and do whatever is the best from the standpoint of UDT/FDT/LDT counterfactual operationalization. I.e., let’s suppose payoffs are:
1. Cooperate:Cooperate − 10:10
2. Cooperate:Defect − ~~20:2~~ 15:2
3. Defect:Defect − 5:5
4. Nuke - −100:-100.
You are playing against extortionist counterparty. Prior probability of extortionist from hypothesis 2 is x. Extortionists from hypothesis 1 can perfectly predict your responce in their decision to self-modify and cover tracks. If they decide to not self-modify, they choose to cooperate conditional on your cooperation. Let’s call policy “Nuke extortionist, cooperate with non-extortionist” $π_{1}$ and “Cooperate with both” $π_{2}$
From here, your UDT-expected utility is:
$E U (π_{1}) = - 100 x + 10 (1 - x)$
$E U (π_{2}) = 2$
Therefore, you should choose $π_{1}$ if
$- 100 x + 10 (1 - x) > 2$
i.e.
$x < \frac{4}{55} \approx 7.27 %$
And 7.27% is a really high frequency of “natural” extortionists, I won’t expect it to be this high.
- Oskar Mathiasen 14 May 2025 11:33 UTC
  6 points
  2
  Parent
  Minor note. Your choice of utilities makes a ⁵⁰⁄₅₀ mixture of Cooperate:Defect and Defect:Cooperate better than the Cooperate:Cooperate outcome. So Cooperate:Cooperate isnt on the pareto frontier.
- CronoDAS 12 May 2025 11:59 UTC
  2 points
  0
  Parent
  Does any process in which they ended up the way they did without considering your decision procedure count as #2? Like, suppose almost all the other agents it expects to encounter are CDT agents that do give in to extortion, and it thinks the risk of nuclear war with the occasional rock or UDT agent is worth it.
  - quetzal_rainbow 12 May 2025 12:20 UTC
    2 points
    0
    Parent
    almost all the other agents it expects to encounter are CDT agents
    Given this particular setup (you both get source codes of each other and make decision simultaneously without any means to verify choices of counterparty until outcomes happen), you shouldn’t self-modify into extortionist, because CDT agents always defect, because no amount of reasoning about source code can causally affect your decision and D-D is Nash equilibrium. CDT agents can expect with high probability to meet extortionist in the future and self-modify into weird Son-of-CDT agent, which gives in to extortion, but for this setup to work in any non-trivial way you should be at least EDT-ish.
    But yes, general principle here is “evaluate how much other player decision procedure is logically influenced by my decision procedure, calculate expected value, act accordingly”. The same is true for situation when you decide about self-modification.
    For example, if you think that modifying into extortionist is a good policy, it can lead to situation where everyone is extortionist and everybody nukes each other.
Ape in the coat 13 May 2025 10:53 UTC
3 points
1
The optimal strategy seems to be Prudent Extorter:
1. Extort agents vulnerable to extortion
2. Cooperate with agents that are not vulnerable to extortions and will cooperate back.
3. Defect against everyone else.
Such agents perform better than your Naive Extorter as they would be able to cooperate with each other and do not nuke a Defection Rock.
When Naive Extorter meets Prudent Extorter they die in a nuclear fire.
CronoDAS 14 May 2025 12:33 UTC
2 points
−1
I actually have found an example of a strategy that doesn’t incentivize someone else to self-modify into Hawkbot: https://www.lesswrong.com/posts/TXbFFYpNWDmEmHevp/how-to-give-in-to-threats-without-incentivizing-them

Basically, when you’re faced with a probable extorter, you play Cooperate some of the time (so you don’t always get nuked) but either Defect or Nuke back often enough that Hawkbot gets a lower expected value than Cooperate/Cooperate.

No comments.