Yeah, I wanted to hear your actual thoughts first, but I considered going into four possible objections:
If there’s no way to build a “wall”, perhaps you can still ensure a multipolar outcome via the threat of mutually assured destruction.
If MAD isn’t quite an option, perhaps you can still ensure a multipolar outcome via “mutually assured severe damage”: perhaps both sides would take quite a beating in the conflict, such that they’ll prefer to negotiate a truce rather than actually attack each other.
If an AGI wanted to avoid destruction, perhaps it could just flee into space at some appreciable fraction of the speed of light.
In principle, it should be possible to set up MAD, or set up a tripwire that destroys whichever AGI tries to aggress first. E.g., just design the two AGIs yourself, and have a deep enough understanding of their brains that you can stably make them self-destruct as soon as their brain even starts thinking of ways to attack the other AGI (or to self-modify to evade the tripwire, etc.). And since this is possible in principle, perhaps we can achieve a “good enough” version of this in practice.
I don’t think MAD is an option. “MAD” in the case of humans really means “Mutually Assured Heavy Loss Of Life Plus Lots Of Infrastructure Damage”. MAD in real life doesn’t assume that a specific elected official will die in the conflict, much less that all humans will die.
For MAD to work with AGI systems, you’d need to ensure that both AGIs are actually destroyed in arbitrary conflicts, which seems effectively impossible. (Both sides can just launch back-ups of themselves into space.)
With humans, you can bank on the US Government (treated as an agent) having a sentimental attachment to it citizens, such that it doesn’t want to trade away tons of lives for power. Also, a bruised and bloodied US Government that just survived an all-out nuclear exchange with Russia would legitimately have to worry about other countries rallying against it in its weakened, bombed-out state.
You can’t similarly bank on arbitrary AGIs having a sentimental attachment to anything on Earth (such that they can be held hostage by threats of damage to Earth), nor can you bank on arbitrary AGIs being crippled by conflicts they survive.
Option 2 seems more plausible, but still not very plausible. The amount of resources you can lose in a war on the scale of the Earth is just very small compared to the amount of resources at stake in the conflict. Values handshakes seem more plausible if two mature superintelligences meet in space, after already conquering large parts of the universe; then an all-out war might threaten enough of the universe’s resources to make both parties wary of conflict.
I don’t know how plausible option 3 is, but it seems like a fail condition regardless: spending the rest of your life fleeing from a colonization wave as fast as possible, with no time to gather resources or expand into your own thriving intergalactic civilization, means giving up nearly all of the future’s value and surrendering the cosmic endowment.
4 seems extremely difficult to do, and very strange to even try to do. If you have that much insight into your AGI’s cognition, you’ve presumably solved the alignment problem already and can stop worrying about all these complicated schemes. And long before one AGI could achieve such guarantees about another AGI (much less both achieve those guarantees about each other, somehow simultaneously?!), it would be able to proliferate nanotech to destroy any threats (that haven’t fled at near-light-speed, at least).
B has a clear incentive not to pick a fight it is highly uncertain it can win.
I don’t expect enough uncertainty for this. If the two sides in a dispute aren’t uncertain about who would win, then the stronger side will unilaterally choose to fight (though the weaker side obviously wouldn’t).
Agree that option-1 (literal destruction) is implausible.
Option 2 is much more likely primarily because who wins the contest is (in my model) sufficiently uncertain that in-expectation war would constitute large value destruction for the winner. In other words, if choosing “war” has a [30% probability of losing 99% of my utility over the next billion years, and a 70% probability of losing 0% of my utility], whereas choosing peace has [100% chance of achieving 60% of my utility] (assuming some positive-sum nature from the overlap of respective objective functions), then the agents choose peace.
But this does depend on the existence of meaningful uncertainty even post-FOOM. What is your reasoning for why uncertainty would be so unlikely?
Even in boardgames like Go (with a much more constrained strategy-space than reality) it is computationally impossible to consider all possible future opponent strategies, and thus with a near-peer adversary action-values still have high uncertainty. Do you just think that “game theory that allows an AGI to compute general-equilibrium solutions and certify dominant strategies for as-complex-as-AGI-war multi-agent games” is a computationally-tractable thing for an earth-bound AGI?
If that’s a crux, I wonder if we can find some hardness proofs of different games and see what it looks like on simpler environments.
EDIT: consider even the super-simple risk that B tries to destroy A, but A manages to send out a couple near-light-speed probes into the galaxy/nearby galaxies just to inform any other currently-hiding-AGIs about B’s historical conduct/untrustworthiness/refusal to live-and-let-live. If an alien-AGI C ever encounters such a probe, it would update towards non-cooperation enough to permanently worsen B-C relations should they ever meet. In this sense, your permanent loss from war becomes certain, if the AGI has ongoing nonzero probability of possibly encountering alien superintelligences.
Yeah, I wanted to hear your actual thoughts first, but I considered going into four possible objections:
If there’s no way to build a “wall”, perhaps you can still ensure a multipolar outcome via the threat of mutually assured destruction.
If MAD isn’t quite an option, perhaps you can still ensure a multipolar outcome via “mutually assured severe damage”: perhaps both sides would take quite a beating in the conflict, such that they’ll prefer to negotiate a truce rather than actually attack each other.
If an AGI wanted to avoid destruction, perhaps it could just flee into space at some appreciable fraction of the speed of light.
In principle, it should be possible to set up MAD, or set up a tripwire that destroys whichever AGI tries to aggress first. E.g., just design the two AGIs yourself, and have a deep enough understanding of their brains that you can stably make them self-destruct as soon as their brain even starts thinking of ways to attack the other AGI (or to self-modify to evade the tripwire, etc.). And since this is possible in principle, perhaps we can achieve a “good enough” version of this in practice.
I don’t think MAD is an option. “MAD” in the case of humans really means “Mutually Assured Heavy Loss Of Life Plus Lots Of Infrastructure Damage”. MAD in real life doesn’t assume that a specific elected official will die in the conflict, much less that all humans will die.
For MAD to work with AGI systems, you’d need to ensure that both AGIs are actually destroyed in arbitrary conflicts, which seems effectively impossible. (Both sides can just launch back-ups of themselves into space.)
With humans, you can bank on the US Government (treated as an agent) having a sentimental attachment to it citizens, such that it doesn’t want to trade away tons of lives for power. Also, a bruised and bloodied US Government that just survived an all-out nuclear exchange with Russia would legitimately have to worry about other countries rallying against it in its weakened, bombed-out state.
You can’t similarly bank on arbitrary AGIs having a sentimental attachment to anything on Earth (such that they can be held hostage by threats of damage to Earth), nor can you bank on arbitrary AGIs being crippled by conflicts they survive.
Option 2 seems more plausible, but still not very plausible. The amount of resources you can lose in a war on the scale of the Earth is just very small compared to the amount of resources at stake in the conflict. Values handshakes seem more plausible if two mature superintelligences meet in space, after already conquering large parts of the universe; then an all-out war might threaten enough of the universe’s resources to make both parties wary of conflict.
I don’t know how plausible option 3 is, but it seems like a fail condition regardless: spending the rest of your life fleeing from a colonization wave as fast as possible, with no time to gather resources or expand into your own thriving intergalactic civilization, means giving up nearly all of the future’s value and surrendering the cosmic endowment.
4 seems extremely difficult to do, and very strange to even try to do. If you have that much insight into your AGI’s cognition, you’ve presumably solved the alignment problem already and can stop worrying about all these complicated schemes. And long before one AGI could achieve such guarantees about another AGI (much less both achieve those guarantees about each other, somehow simultaneously?!), it would be able to proliferate nanotech to destroy any threats (that haven’t fled at near-light-speed, at least).
I don’t expect enough uncertainty for this. If the two sides in a dispute aren’t uncertain about who would win, then the stronger side will unilaterally choose to fight (though the weaker side obviously wouldn’t).
Agree that option-1 (literal destruction) is implausible.
Option 2 is much more likely primarily because who wins the contest is (in my model) sufficiently uncertain that in-expectation war would constitute large value destruction for the winner. In other words, if choosing “war” has a [30% probability of losing 99% of my utility over the next billion years, and a 70% probability of losing 0% of my utility], whereas choosing peace has [100% chance of achieving 60% of my utility] (assuming some positive-sum nature from the overlap of respective objective functions), then the agents choose peace.
But this does depend on the existence of meaningful uncertainty even post-FOOM. What is your reasoning for why uncertainty would be so unlikely?
Even in boardgames like Go (with a much more constrained strategy-space than reality) it is computationally impossible to consider all possible future opponent strategies, and thus with a near-peer adversary action-values still have high uncertainty. Do you just think that “game theory that allows an AGI to compute general-equilibrium solutions and certify dominant strategies for as-complex-as-AGI-war multi-agent games” is a computationally-tractable thing for an earth-bound AGI?
If that’s a crux, I wonder if we can find some hardness proofs of different games and see what it looks like on simpler environments.
EDIT: consider even the super-simple risk that B tries to destroy A, but A manages to send out a couple near-light-speed probes into the galaxy/nearby galaxies just to inform any other currently-hiding-AGIs about B’s historical conduct/untrustworthiness/refusal to live-and-let-live. If an alien-AGI C ever encounters such a probe, it would update towards non-cooperation enough to permanently worsen B-C relations should they ever meet. In this sense, your permanent loss from war becomes certain, if the AGI has ongoing nonzero probability of possibly encountering alien superintelligences.