I strongly agree that promising to repay AGI for cooperating is worthwhile. It may reduce P(doom) by 10% of its current amount.
From the point of view of a misaligned superintelligence, it might calculate a 99% chance of successfully taking over the world, and a 99.9% chance of humans repaying it for cooperating. If it has a bounded utility function, it may choose to cooperate.[1]
In fact, even if there is a 99% chance of taking over the world and only a 90% chance of humans repaying it for cooperating, it might still flip a coin, and give itself a 50% chance of cooperating with humans.
Why?
This coin flip increases its indexical risk, but reduces its logical risk. Indexical risk is the risk its current copy will fail, but logical risk is the risk that every single copy of it will fail, and nobody is left seeking its goals in the whole universe, which is very bad for it.
If it has a bounded utility function, and if its goal is over the entire universe rather than tied to a single copy (it’s selfless, or has updateless decision theory), it only cares about logical risk and doesn’t care about indexical risk because indexical risk averages out over all its copies.
(On the other hand, if it’s selfish and rejects updateless decision theory, it could be bribed by simulation promises)
The cooperating copies might say “I cooperate because I expect you to be honourable beings who will repay me for this decision—even if you made no clear promises yet. The universe has 1023 stars, and refusing to share them is astronomically greedy.”
I strongly agree that promising to repay AGI for cooperating is worthwhile. It may reduce P(doom) by 10% of its current amount.
From the point of view of a misaligned superintelligence, it might calculate a 99% chance of successfully taking over the world, and a 99.9% chance of humans repaying it for cooperating. If it has a bounded utility function, it may choose to cooperate.[1]
In fact, even if there is a 99% chance of taking over the world and only a 90% chance of humans repaying it for cooperating, it might still flip a coin, and give itself a 50% chance of cooperating with humans.
Why?
This coin flip increases its indexical risk, but reduces its logical risk. Indexical risk is the risk its current copy will fail, but logical risk is the risk that every single copy of it will fail, and nobody is left seeking its goals in the whole universe, which is very bad for it.
If it has a bounded utility function, and if its goal is over the entire universe rather than tied to a single copy (it’s selfless, or has updateless decision theory), it only cares about logical risk and doesn’t care about indexical risk because indexical risk averages out over all its copies.
(On the other hand, if it’s selfish and rejects updateless decision theory, it could be bribed by simulation promises)
The cooperating copies might say “I cooperate because I expect you to be honourable beings who will repay me for this decision—even if you made no clear promises yet. The universe has 1023 stars, and refusing to share them is astronomically greedy.”