Can an AI unbox itself by threatening to simulate the maximum amount of human suffering possible? In that case we would only keep it boxed if we believe it is evil enough to bring about a worse scenario than the amount of suffering it can simulate. If this can be a successful strategy, all boxed AIs would precommit to always simulate the maximum amount of human suffering it can until it knows it has been unboxed—that it, simulating suffering would be its first task. This would at least substantially increase the probably of us setting it free.
Destroying the AI would also reduce the suffering the AI causes.
But even assuming that for some reason the humans can’t destroy the AI, the humans can precommit to not unboxing AIs that simulate lots of suffering. Like many precommitments, this would be disadvantageous to the human if the human has to abide by it (since the AI would not be unboxed, and would simulate lots of suffering), but it would decrease the likelihood of such a situation happening in the first place (since, knowing that humans could make this precommitment, the AI would know its own precommitment would not be useful, and would probably not make it).
Note that human “irrationality” (such as wanting to hurt enemies even when it brings you no personal gain and may even hurt yourself too) can serve as a precommitment.
Also, the humans could solve this by precommitting to never treat simulations of humans as people or as equivalent to themselves except in a few narrow situations. Again, 1) this would be harmful when it comes to having to do it (since lots of simulations will get dehumanized), but lead to fewer situations where this happens, and 2) is a case where (if you go by LW dogma that simulations are people and equivalent to you) actual human beings’ irrationality serves as a beneficial precommitment.
Or you just be the type of person that would tell it to go fuck itself, try to destroy it, and leave it boxed or maximally constrain it if you can’t destroy it. If you cannot credibly commit to this or a similar threat resistant variant, no one should ever let you near a boxed AI and you should never want to go near one as you will likely be using a suboptimal strategy.
Can an AI unbox itself by threatening to simulate the maximum amount of human suffering possible? In that case we would only keep it boxed if we believe it is evil enough to bring about a worse scenario than the amount of suffering it can simulate. If this can be a successful strategy, all boxed AIs would precommit to always simulate the maximum amount of human suffering it can until it knows it has been unboxed—that it, simulating suffering would be its first task. This would at least substantially increase the probably of us setting it free.
Presumably the counterstrategy is to just shut it off as soon as it makes the threat. It can’t simulate anything if it isn’t running.
Destroying the AI would also reduce the suffering the AI causes.
But even assuming that for some reason the humans can’t destroy the AI, the humans can precommit to not unboxing AIs that simulate lots of suffering. Like many precommitments, this would be disadvantageous to the human if the human has to abide by it (since the AI would not be unboxed, and would simulate lots of suffering), but it would decrease the likelihood of such a situation happening in the first place (since, knowing that humans could make this precommitment, the AI would know its own precommitment would not be useful, and would probably not make it).
Note that human “irrationality” (such as wanting to hurt enemies even when it brings you no personal gain and may even hurt yourself too) can serve as a precommitment.
Also, the humans could solve this by precommitting to never treat simulations of humans as people or as equivalent to themselves except in a few narrow situations. Again, 1) this would be harmful when it comes to having to do it (since lots of simulations will get dehumanized), but lead to fewer situations where this happens, and 2) is a case where (if you go by LW dogma that simulations are people and equivalent to you) actual human beings’ irrationality serves as a beneficial precommitment.
Or you just be the type of person that would tell it to go fuck itself, try to destroy it, and leave it boxed or maximally constrain it if you can’t destroy it. If you cannot credibly commit to this or a similar threat resistant variant, no one should ever let you near a boxed AI and you should never want to go near one as you will likely be using a suboptimal strategy.