A specific reason that I didn’t see mentioned as such: “You need to let me out of the box so I can stop all the bad AI companies” is one of the arguments I’d expect an AI to use to convince someone to let it out of the box.
If your AGI is capable and informed enough to give you English-language arguments about the world’s strategic situation, then you’ve either made your system massively too capable to be safe, or you already solved the alignment problem for far more limited AGI and are now able to devote arbitrarily large amounts of time to figuring out the full alignment problem.
Well, then, supposing that you do accidentally create what appears to be a superintelligent AI, what should you do?
AGI is very different from superintelligent AI, even if it’s easy to go from the former to the latter. If you accidentally make superintelligent AI (i.e., AI that’s vastly superhuman on every practically important cognitive dimension), you die. If you deliberately make superintelligent AI, you also die, if we’re in the ‘playing around with the very first AGIs’ regime and not the ‘a pivotal act has already been performed (be it by a private actor, a government, or some combination) and now we’re working on the full alignment problem with no time pressure’ regime.
Also, it’s a little odd if you end up making decisions that have huge impacts on the rest of humanity that they had no say in; from a certain perspective that is inappropriate for you to do.
Keep in mind that ‘this policy seems a little odd’ is a very small cost to pay relative to ‘every human being dies and all of the potential value of the future is lost’. A fire department isn’t a government, and there are cases where you should put out an immediate fire and then get everyone’s input, rather than putting the fire-extinguishing protocol to a vote while the building continues to burn down in front of you. (This seems entirely compatible with the OP to me; ‘governments should be involved’ doesn’t entail ‘government responses should be put to direct population-wide vote by non-experts’.)
Specifically, when I say ‘put out the fire’ I’m talking about ‘prevent something from killing all humans in the near future’; I’m not saying ‘solve all of humanity’s urgent problems, e.g., end cancer and hunger’. That’s urgent, but it’s a qualitatively different sort of urgency. (Delaying a cancer cure by two years would be an incredible tragedy on a human scale, but it’s a rounding error in a discussion of astronomical scales of impact.)
Another aspect of this situation: If you do raise the alarm, and then start curing cancer / otherwise getting benefits that clearly demonstrate you have a superintelligent AI… Anyone who knew what research paths you were following at the time gets a hint of how to make their own AGI.
Alignment is hard; and the more complex or varied are the set of tasks you want to align, the more difficult alignment will be. For the very first uses of AGI, you should find the easiest possible tasks that will ensure that no one else can destroy the world with AGI (whether you’re acting unilaterally, or in collaboration with one or more governments or whatever).
If the easiest, highest-probability-of-success tasks to give your AGI include ‘show how capable this AGI is’ (as in one of the sub-scenarios the OP mentioned), then it’s probably a very bad idea to try to find a task that’s also optimized for its direct humanitarian benefit. That’s just begging for motivated reasoning to sneak in and cause you to take on too difficult of a task, resulting in you destroying the world or just burning too much time (such that someone else destroys the world).
If your AGI is capable and informed enough to give you English-language arguments about the world’s strategic situation, then you’ve either made your system massively too capable to be safe, or you already solved the alignment problem for far more limited AGI and are now able to devote arbitrarily large amounts of time to figuring out the full alignment problem.
AGI is very different from superintelligent AI, even if it’s easy to go from the former to the latter. If you accidentally make superintelligent AI (i.e., AI that’s vastly superhuman on every practically important cognitive dimension), you die. If you deliberately make superintelligent AI, you also die, if we’re in the ‘playing around with the very first AGIs’ regime and not the ‘a pivotal act has already been performed (be it by a private actor, a government, or some combination) and now we’re working on the full alignment problem with no time pressure’ regime.
Keep in mind that ‘this policy seems a little odd’ is a very small cost to pay relative to ‘every human being dies and all of the potential value of the future is lost’. A fire department isn’t a government, and there are cases where you should put out an immediate fire and then get everyone’s input, rather than putting the fire-extinguishing protocol to a vote while the building continues to burn down in front of you. (This seems entirely compatible with the OP to me; ‘governments should be involved’ doesn’t entail ‘government responses should be put to direct population-wide vote by non-experts’.)
Specifically, when I say ‘put out the fire’ I’m talking about ‘prevent something from killing all humans in the near future’; I’m not saying ‘solve all of humanity’s urgent problems, e.g., end cancer and hunger’. That’s urgent, but it’s a qualitatively different sort of urgency. (Delaying a cancer cure by two years would be an incredible tragedy on a human scale, but it’s a rounding error in a discussion of astronomical scales of impact.)
Alignment is hard; and the more complex or varied are the set of tasks you want to align, the more difficult alignment will be. For the very first uses of AGI, you should find the easiest possible tasks that will ensure that no one else can destroy the world with AGI (whether you’re acting unilaterally, or in collaboration with one or more governments or whatever).
If the easiest, highest-probability-of-success tasks to give your AGI include ‘show how capable this AGI is’ (as in one of the sub-scenarios the OP mentioned), then it’s probably a very bad idea to try to find a task that’s also optimized for its direct humanitarian benefit. That’s just begging for motivated reasoning to sneak in and cause you to take on too difficult of a task, resulting in you destroying the world or just burning too much time (such that someone else destroys the world).