I think it might be as simple as not making threats against agents with compatible values.
In all of Yudkowsky’s fiction the distinction between threats (and unilateral actions removing consent from another party) and deterrence comes down to incompatible values.
The baby-eating aliens are denied access to a significant portion of the universe (a unilateral harm to them) over irreconcilable values differences. Harry Potter transfigures Voldemort away semi-permanently non-consensually because of irreconcilable values differences. Carissa and friends deny many of the gods their desired utility over value conflict.
Planecrash fleshes out the metamorality with the presumed external simulators who only enumerate the worlds satisfying enough of their values, with the negative-utilitarians having probably the strongest “threat” acausally by being more selective.
Cooperation happens where there is at least some overlap in values and so some gains from trade to be made. If there are no possible mutual gains from trade then the rational action is to defect at a per-agent cost up to the absolute value of the negative utility of letting the opposing agent achieve their own utility. Not quite a threat, but a reality about irreconcilable values.
Yudkowsky put a lot of focus on the inadequacy of threats, and that was one part I never understood. Like he said Dath Ilan would destroy the universe before giving into aliens that would say “give us $5 or we’ll destroy the universe”. But other humans are doing way worse than that all the time, all over the place, especially in positions of power, yet if we all went MAD all the time there’d be no humanity.
On a recent re-read I think I understand a bit better.
It’s true that individual humans can’t realistically avoid giving in to threats or even accidentally threatening others, but institutions can commit to it as a legible position, e.g. “we will not negotiate with terrorists”.
If an irrational entity has the ability to unilaterally destroy the universe then it’s probably going to get destroyed anyway, so it makes more sense to follow through on precommitments in the real world and in counterfactuals to coordinate with actually rational agents.
I think the key is that if we all went MAD legibly at the same time then things would work out a lot better. And refusing to give in to threats doesn’t necessarily mean destruction, it can be as simple as collectively refusing to pay ransomware attackers even though it is currently more expensive, in the expectation that eventually it will be less expensive.
I think it might be as simple as not making threats against agents with compatible values.
In all of Yudkowsky’s fiction the distinction between threats (and unilateral actions removing consent from another party) and deterrence comes down to incompatible values.
The baby-eating aliens are denied access to a significant portion of the universe (a unilateral harm to them) over irreconcilable values differences. Harry Potter transfigures Voldemort away semi-permanently non-consensually because of irreconcilable values differences. Carissa and friends deny many of the gods their desired utility over value conflict.
Planecrash fleshes out the metamorality with the presumed external simulators who only enumerate the worlds satisfying enough of their values, with the negative-utilitarians having probably the strongest “threat” acausally by being more selective.
Cooperation happens where there is at least some overlap in values and so some gains from trade to be made. If there are no possible mutual gains from trade then the rational action is to defect at a per-agent cost up to the absolute value of the negative utility of letting the opposing agent achieve their own utility. Not quite a threat, but a reality about irreconcilable values.
Yudkowsky put a lot of focus on the inadequacy of threats, and that was one part I never understood. Like he said Dath Ilan would destroy the universe before giving into aliens that would say “give us $5 or we’ll destroy the universe”. But other humans are doing way worse than that all the time, all over the place, especially in positions of power, yet if we all went MAD all the time there’d be no humanity.
On a recent re-read I think I understand a bit better.
It’s true that individual humans can’t realistically avoid giving in to threats or even accidentally threatening others, but institutions can commit to it as a legible position, e.g. “we will not negotiate with terrorists”.
If an irrational entity has the ability to unilaterally destroy the universe then it’s probably going to get destroyed anyway, so it makes more sense to follow through on precommitments in the real world and in counterfactuals to coordinate with actually rational agents.
I think the key is that if we all went MAD legibly at the same time then things would work out a lot better. And refusing to give in to threats doesn’t necessarily mean destruction, it can be as simple as collectively refusing to pay ransomware attackers even though it is currently more expensive, in the expectation that eventually it will be less expensive.