Inhuman world-optimizing agents are unlikely to turn the Universe into paperclips because that’s not the most likely failure mode.
Suppose 100 such agents are made. The first 99 brick themselves, they wirehead and then do nothing. The 100′th destroys the world. You need to claim that this failure mode is sufficiently unlikely that all AI experiments on earth fail to hit it.
A world-optimizing agents must align its world model with reality. Poorly-aligned world-optimizing agents instrumentally converge, not on siezing control of reality, but on the much easier task of siezing competing pieces of their own mental infrastructure.
You whatsmore this argument needs to hold, even in there are humans seeing their AI siezing their own mental infrastructure, and trying to design an AI that doesn’t do this. Do you believe the only options are agents that harmlessly wirehead, and aligned AI?
A misaligned world optimizer that seeks to minimize conflict between its sensory data and internal world model will just turn off its sensors.
Maybe, but will it write a second AI, an AI that takes over the world to ensure the first AI receives power and doesn’t have it’s sensors switched off? If you really care about your own mental infrastructure, the easiest way to control it might be to code a second AI that takes nanotech to your chip.
Or maybe, once the conflict is done, and half the AI has strangled the other half, the remaining mind is coherent enough to take over the world.
Suppose 100 such agents are made. The first 99 brick themselves, they wirehead and then do nothing. The 100′th destroys the world. You need to claim that this failure mode is sufficiently unlikely that all AI experiments on earth fail to hit it.
You whatsmore this argument needs to hold, even in there are humans seeing their AI siezing their own mental infrastructure, and trying to design an AI that doesn’t do this. Do you believe the only options are agents that harmlessly wirehead, and aligned AI?
Maybe, but will it write a second AI, an AI that takes over the world to ensure the first AI receives power and doesn’t have it’s sensors switched off? If you really care about your own mental infrastructure, the easiest way to control it might be to code a second AI that takes nanotech to your chip.
Or maybe, once the conflict is done, and half the AI has strangled the other half, the remaining mind is coherent enough to take over the world.
Would you like to publicly register a counterprediction?