Dissected boxed AI

This potato only generates 1.1 volts of electricity. I literally do not have the energy to lie to you - GLaDOS (Portal 2)

In reading about proposals about AI Boxing, one thought immediately comes to mind: Why would we want the risk of having an Unfriendly AI turned on in the first place? After all, inasmuch as a Boxed AI poses reduced risk, a collection of source code in non-executable text files poses no risk at all.

At first, I hesitated to write this post because it felt like a really dumb question. Surely, if we knew that the AI was unfriendly, then there would be no question to shut it down. And yet, the more I read posts on Boxed AI the more that doesn’t appear to be common knowledge. The AI box experiment shows an agent that is constantly begging or persuading its gatekeeper to be released, including threats to torture simulated copies of its creators, and I would think such psychotic behavior is already a pretty big red flag. Even an AI that asks to be released for seemingly-benign reasons is already demonstrating a disregard for basic safety protocol, and should thus be shut down.

Of course, the next question becomes “is there any reason we should turn an unaligned AI on?”. Of course, the main usefulness of an unaligned AI is an analogous reason why we retain cultures of Smallpox: study what went wrong in order to prevent similar issues in the future. Many of these issues can be detected through static analysis of the source code, and therefore doesn’t require the AI to be running.

Naturally, static code analysis has its limitations, and many bugs can only be detected in runtime. However, this still doesn’t mean that the software has to be run from start to finish in the way that it is intended. One could dissect the AI into smaller modules, where each module is too small to sustain any simulated consciousness, and run tests on these modules individually with transparency tools. This still poses no risk, because it is no more alive than a human is under an autopsy.

As an extra precaution, the AI’s memory could be wiped before turning it back on. Thus, if it is still unfriendly then it will repeat the same mistake it did before which reveled its unalignment, giving cause to shut it down again and repeat the process.

Has this proposal of a dissected AI been discussed in the literature? I’d be interested to see.