I’ve found it helpful to think in terms of output channels rather than boxing. When you design an AI, you can choose freely how many output channels you give it; this is a spectrum. More output channels means less security but more capability. A few relevant places on the spectrum:
No output channels at all (if you want to use the AI anyway, you have to do so with interpretability tools, this is Microscope AI)
A binary output channel. This is an Oracle AI.
A text box. GPT-3 has that.
A robot body
While I don’t have sources right now, afaik this problem has been analyzed at a decent level of depth, and no-one has found a point on the spectrum that really impresses in terms of capability and safety. A binary output channel is arguably already unsafe, and a text box definitely is. Microscope AI may be safe (though you could certainly debate that) but we are about ∞ far away from having interpretability tools that make it competitive.
I’ve found it helpful to think in terms of output channels rather than boxing. When you design an AI, you can choose freely how many output channels you give it; this is a spectrum. More output channels means less security but more capability. A few relevant places on the spectrum:
No output channels at all (if you want to use the AI anyway, you have to do so with interpretability tools, this is Microscope AI)
A binary output channel. This is an Oracle AI.
A text box. GPT-3 has that.
A robot body
While I don’t have sources right now, afaik this problem has been analyzed at a decent level of depth, and no-one has found a point on the spectrum that really impresses in terms of capability and safety. A binary output channel is arguably already unsafe, and a text box definitely is. Microscope AI may be safe (though you could certainly debate that) but we are about ∞ far away from having interpretability tools that make it competitive.