Davidmanheim comments on How to Give in to Threats (without incentivizing them)

Davidmanheim 24 Mar 2025 10:33 UTC
2 points
0
I’m a bit confused how this is a problem.

Either there is an agent that stands to benefit from my acceding to a threat, or there is not. If an agent “sufficiently” turns itself into a rock for a single interaction, but reaps the benefit as an agent, it’s a full-fledged agent. Same if it sends a minion, where the relevant agent is the one who sent the rock, not the rock. And if we have uncertainty about the situation, that’s part of the game.

If the question is whether other players can deceive you about the nature of the game or the probabilities, sure, that is a possibility, but it is not really a question about LDT, it’s just a question about whether we should expand every decision into a recursive web of uncertainties about all other possible agents—and, I suspect, come to the conclusion that smarter agents can likely fool you, and you shouldn’t allow others with misaligned incentives to manipulate your information environment, especially if they have more optimization power than you do. But as we all should know, once we make misaligned super-intelligent systems, we stop being meaningful players anyways.

In this world, maybe you want to suppose the agent’s terminal value is to cause me to pay some fixed cost, and it permanently disables itself to that end—but that makes it either a minion sent by something else, or a natural feature of a Murphy-like universe where you started out screwed, in which case you should treat the natural environment as an adversary. But that’s not our situation, again, at least until ASI shows up.

cc: @Mikhail Samin—does that seem right to you?
What links here?
- Davidmanheim's comment on LDT (and everything else) can be irrational by Christopher King (24 Mar 2025 10:46 UTC; 2 points)
- Mikhail Samin 24 Mar 2025 17:26 UTC
  2 points
  0
  Parent
  Seems right!