[Question] Can a pre-commitment to not give in to blackmail be “countered” by a pre-commitment to ignore such pre-commitments?

Sappique4 Jul 2025 11:48 UTC

10 points

As I understand it an actor can prevent blackmail^[1] by (rational) actors it they credibly pre-commit to never give in to blackmail.

Example: A newly elected mayor has many dark secrets and lots of people are already planning on blackmailing them. To preempt any such blackmail they livestreams themself being hypnotized and implanted with the suggestion to never give into blackmail. Since in this world hypnotic suggestions are unbreakable, all (rational) would-be blackmailers give up, since any attempt at blackmail would be guaranteed to fail.

In general pre-commiting in such examples is about reducing the payoff matrix to just [blackmail, refuse] and [don’t blackmail, refuse], which makes not blackmailing the optimal choice for the would-be blackmailer.

Of course, sufficiently intelligent / coherent actors wouldn’t need a external commitment mechanism and a sufficiently intelligent and informed opposition would be able to infer the existence of such a pre-commitment. More so, I believe to have heard that if a sufficiently intelligent / coherent actors notices that it would be better of if it had pre-commited, it can just act as if it had (post-commit?).

However, what if the would-be blackmailer also tries to limit the possible outcomes?

Example: The anti-blackmail hypnosis is so successful that soon every newly elected mayor does it. A new candidate is likely to win the next election. They know that the local crime boss has a lot of dirt on them, but they aren’t worried about blackmail, as they will just do the anti-blackmail hypnosis on their first day in office. On the evening of the election they are send a video of the crime boss being hypnotized into blackmailing the new mayor even if they have been anti-blackmail hypnotized.

This cuts down the payoff matrix to [blackmail, refuse] and [blackmail, give in]. Giving in to the blackmail is optimal for the new mayor and doing the anti-blackmail hypnosis just locks them into [blackmail, refuse].

So how does this work out between sufficiently intelligent / coherent actors? Does the first one to (credibly and transparently) pre-commit win?

And what if actors are able to post-commit (if that even is a thing and I didn’t misunderstand the concept)? A actor could act as if they had pre-commited to ignore the oppositions pre-commitment (to ignore pre-commitments to never give into blackmail), but then the opposition could act as if they had pre-commited to ignore that pre-commitment?

(This comment thread seems to discuss the same question but did not resolve it for me.)

^
By blackmail I mean a scenario where the would-be blackmailers choices are blackmail or don’t blackmail and the targets choices give in or refuse with a payoff matrix like this:
give in refuse
blackmail target: −10
blackmailer: 20 target: −100
blackmailer: −1
don’t blackmail target: 0
blackmailer: 0 target: 0
blackmailer: 0

	give in	refuse
blackmail	target: −10 blackmailer: 20	target: −100 blackmailer: −1
don’t blackmail	target: 0 blackmailer: 0	target: 0 blackmailer: 0

What links here?

nowl's comment on buny’s Shortform by nowl (6 Jul 2025 12:06 UTC; 1 point)

Sappique4 Jul 2025 11:48 UTC

10 points

12 comments1 min readLW link

Rationality Decision theory

avturchin 4 Jul 2025 12:28 UTC
3 points
0
This reminds me nested time machines discussed by gwern. https://gwern.net/review/timecrimes
Precomitments plays the role of time loops and they can propagate almost infinitely in time and space. For example, any one who is going to become a major, can pre-pre-pre-commit never open any video for mafia boss etc.

nowl 6 Jul 2025 10:06 UTC

2 points

I’ve thought about this before too, and I no longer feel confused about it. It helps to reduce this into a decision problem. The decision problem could ‘be about’ programs deciding anything, in principle; it doesn’t need to be ‘agents deciding whether to blackmail’.

I’ll show decision structures symmetric to your examples, then give some more examples that might help.

Language of post’s examples	Language for decision problems
Crime boss Mayor	Program C Program M
Crime boss does not blackmail mayor Crime boss blackmails mayor	C outputs 0 C outputs 1
Mayor does not give in to blackmail Mayor gives in to blackmail	M outputs 0 M outputs 1

Your first example: M is a more advanced conditioner
C runs: if [M outputs 1 if C outputs 1], output 1; else, output 0
M runs: if C runs "If [M outputs 1 if C outputs 1], output 1; else, output 0", output 0; else, <doesn't occur, unspecified>
Outcome: Both output 0

Your second example
C runs: output 1^[1]
M runs: <unspecified>
Outcome: unspecified

When put like this, it seems clear to me that there’s no paradox here.

Below are examples not from the post. The last one where both try to condition is most interesting.

3. C is commit-rock^[2], M is conditioner
C runs: output 1
M runs: if C runs "If [M outputs 1 if C outputs 1], output 1; else, output 0", output 0; else, output 1
Outcome: both output 1

4. Both are commit-rocks
C runs: output 1
M runs: output 0
Outcome: C outputs 1, M outputs 0

5. Both condition
C runs: run M. if M outputs 1 when C outputs 1, output 1; else, output 0
M runs: run C. if C outputs 0 when M outputs 0, output 0; else, output 1
Outcome: The programs run the other recursively and never halt, as coded.

Again, there is no paradox here.

To directly answer the question in the title, I think a commitment “to not give into blackmail” and a commitment “to blackmail” are logically symmetric, because what a decision problem is about (what the 0s and 1s correspond to in real life) is arbitrary. (Also, separately, there is no “commitment” primitive)

^
I know in your second example you want the Crime boss’s decision to be conditional on the Mayor in some way, but it’s not specified how, so I’m going to just leave it like this with this footnote.
^
In some posts about decision dilemmas, the example is used of “a rock with the word defect written on it” to make it clear that the decision to defect was not conditional on the other player.

What links here?

nowl's comment on buny’s Shortform by nowl (6 Jul 2025 12:06 UTC; 1 point)

Sappique 6 Jul 2025 10:59 UTC
1 point
0
Parent
Thanks, that’s a interesting way to think about pre-commitments.

However, I’m not sure if I understand what your conclusion is. Do you believe that actors can not protect themself from blackmail with pre-commitments?
- nowl 6 Jul 2025 11:06 UTC
  2 points
  0
  Parent
  Do you believe that actors can not protect themself from blackmail with pre-commitments?
  I don’t believe that. If I could prove that, I could also prove the opposite (i.e. replace ‘cannot’ with ‘can always’), because what a decision problem is about is arbitrary. The arbitrariness means any abstract solution has to be symmetric. In example 1, an actor protects themself from blackmail. We can also imagine an inverted example 1, where the more sophisticated conditioner instead represents the blackmailer.
  I think that what happens when both agents are advanced enough to fully understand this kind of problem is most similar to example 5. But in reality, they wouldn’t recursively simulate each other forever, because they’d think that would be a waste of resources. They’d have to make some choice eventually. They’d recognize that there is no asymmetric solution to the abstract problem, before making that choice. I don’t know what their choice would be.
  I can give a guess, with much less confidence than what I wrote about the logic. Given they’re both maximally advanced, they’d know they’ll perform similar reasoning; it’s similar to the prisoners-dillema-with-clone situation. They could converge to a compromise policy-about-blackmail-in-general for their values in their universe, if there are any such compromises available for their values in their universe. I’m finding it hard to predict what such a ‘compromise’ could be when they’re not on relatively equal footing, though, e.g. when one can blackmail the other, and the other can’t do it back. When they are on equal footing, e.g. have equal incentive to blackmail each other, maybe they would do this: “give each other the things the other wants, in cases where this increases our average value” (which is like normal acausal trade).
  After thinking about it more (38 minutes more, compared to when I first posted this comment. I’ve been heavily editing/expanding it), it does feel like a game of ‘mutually’ choosing where-they-end-up-in-the-logical-space, and not one of ‘committing’. Of course, to the extent the decisions are symmetric, they could choose to lock in “I commit to not give in to blackmail, you commit to make and follow through on blackmail”; they just both wouldn’t want that.
  I don’t quite know what else there is to do in that situation other than “symmetrically converge to the mid-point”. Even though I dislike where that leads in “unequal” cases like I described two paragraphs up (<the better-situated superintelligence makes half the blackmail, and the worse-situated superintelligence gives in every time>). Logic doesn’t care what I dislike. If this is true, I’ll just have to hope the side of good wins situationally and can prevent this from manifesting in cases it cares about.
  Disclaimer: the above is about two superintelligences in isolation, not humans.
  What links here?
  - nowl's comment on buny’s Shortform by nowl (6 Jul 2025 12:06 UTC; 1 point)

Vladimir_Nesov 6 Jul 2025 12:44 UTC
3 points
0
Commitments are computations that have influence in the world, that some things in the world are listening to. You can defeat a commitment by denying it influence. Blackmail is in part a commitment to enforce, and there is an opposing commitment to ignore blackmail. These commitments come into conflict, since ignorers trigger enforcement by blackmailers, which leads to more direct conflict between the bearers of the opposing commitments, with implications for influence of those commitments.

If a faction that gives influence to one of these commitments is more powerful than the faction giving influence to the other commitment, it locally won’t go well for those who join the other faction. The game then shifts to the dynamics of how these factions are built and coordinated.
Dacyn 5 Jul 2025 20:24 UTC
1 point
0
It all depends on what you mean by “sufficiently intelligent / coherent actors”. For example, in this comment Eliezer says that it should mean actors that “respond to offers, not to threats”, but in 15 years no one has been able to cash out what this actually means, AFAIK.
- Viliam 13 Jul 2025 19:16 UTC
  6 points
  4
  Parent
  I don’t have a full answer, but my intuition goes roughly in a direction of “what would the other person do if they absolutely couldn’t communicate with you, not even indirectly (e.g. you couldn’t learn about how they interact with others)?”
  If they would leave you alone (i.e. not cause the outcome that you want to avoid), then their action is definitely blackmail. The only reason they attempt to do that thing is to elicit your reaction.
- Sappique 6 Jul 2025 8:43 UTC
  3 points
  0
  Parent
  As far as I can tell from Eliezer’s writing (mostly Planecrash), a threat is when someone will (counterfactually) purposefully minimize someone else’s utility function.
  
  So releasing blackmail material would be a threat, but building a road through someone else’s home (if doing so offers slightly more utility then going around) wouldn’t be?
  
  Actors could pre-commit to ignore any counterfactuals where someone purposefully minimizes their utility function, but then again would-be blackmailers could pre-commit to ignore such pre-commitments.
  
  Maybe pre-commiting to ignore threats is a kind of “pre-commitment shelling point”, that works if everyone does it? If all actors coordinated (even by just modeling other actors and without communication) to pre-commit to ignore threats, the would-be extorters accept that?
  - Dacyn 6 Jul 2025 21:15 UTC
    3 points
    2
    Parent
    Yeah, but what does “purposefully minimize someone else’s utility function” mean? The source code just does stuff. What does it mean for it to be “on purpose”?
    - Sappique 7 Jul 2025 10:25 UTC
      1 point
      0
      Parent
      I believe “on purpose” in this case means, doing something conditional on the other actor’s utility function disvaluing it.
      
      So if you build a interstellar highway through someone’s planet because that is the fastes route, you are not “purposefully minimizing their utility function”, even if they strongly disvalue it. If you build it through their planet only if they disvalue it and would have build it around if they disvalued that, then you are “purposefully minimizing their utility function”.
      
      If you do so to prevent them from having a planet or to make them react in some (useful to you) way, and would have done so even if they didn’t have disvalued their planet being destroyed, then you are not “purposefully minimizing their utility function”, I think?
      - Dacyn 7 Jul 2025 18:32 UTC
        6 points
        2
        Parent
        Let’s talk about a specific example: the Ultimatum Game. According to EY the rational strategy for the responder in the Ultimatum Game is to accept if the split is “fair” and otherwise reject in proportion to how unfair he thinks the split is. But the only reason to reject is to penalize the proposer for proposing an unfair split—which certainly seems to be “doing something conditional on the other actor’s utility function disvaluing it”. So why is the Ultimatum Game considered an “offer” and not a “threat”?
        Sappique 7 Jul 2025 21:15 UTC
        1 point
        0
        Parent
        Good question.
        
        I can’t tell, if saying that you will reject unfair splits would be a threat by the definition in my above comment. For it to be a threat, you would have to only do it if the other person cares about the thing being split. But in the Ultimatum Game both players per definition care about it, so I have a hard time thinking about what you would do if someone offers you a unfair split of something they don’t care about (how can a split even be unfair, if only one person values the thing being split?).