I really don’t understand the AGI in a box part of your arguments: as long as you want your AGI to actually do something (it can be anything, be it you asked for a proof of a mathematical problem or whatever else), its output will have to go through a human anyway, which is basically the moment when your AGI escapes. It does not matter what kind of box you put around your AGI because you always have to open it for the AGI to do what you want it to do.
I’ve found it helpful to think in terms of output channels rather than boxing. When you design an AI, you can choose freely how many output channels you give it; this is a spectrum. More output channels means less security but more capability. A few relevant places on the spectrum:
No output channels at all (if you want to use the AI anyway, you have to do so with interpretability tools, this is Microscope AI)
A binary output channel. This is an Oracle AI.
A text box. GPT-3 has that.
A robot body
While I don’t have sources right now, afaik this problem has been analyzed at a decent level of depth, and no-one has found a point on the spectrum that really impresses in terms of capability and safety. A binary output channel is arguably already unsafe, and a text box definitely is. Microscope AI may be safe (though you could certainly debate that) but we are about ∞ far away from having interpretability tools that make it competitive.
I really don’t understand the AGI in a box part of your arguments: as long as you want your AGI to actually do something (it can be anything, be it you asked for a proof of a mathematical problem or whatever else), its output will have to go through a human anyway, which is basically the moment when your AGI escapes. It does not matter what kind of box you put around your AGI because you always have to open it for the AGI to do what you want it to do.
I’ve found it helpful to think in terms of output channels rather than boxing. When you design an AI, you can choose freely how many output channels you give it; this is a spectrum. More output channels means less security but more capability. A few relevant places on the spectrum:
No output channels at all (if you want to use the AI anyway, you have to do so with interpretability tools, this is Microscope AI)
A binary output channel. This is an Oracle AI.
A text box. GPT-3 has that.
A robot body
While I don’t have sources right now, afaik this problem has been analyzed at a decent level of depth, and no-one has found a point on the spectrum that really impresses in terms of capability and safety. A binary output channel is arguably already unsafe, and a text box definitely is. Microscope AI may be safe (though you could certainly debate that) but we are about ∞ far away from having interpretability tools that make it competitive.