[Image credits: Cardboard box is licensed CC-BY 3.0 from https://creativity103.com/; the picture of the U.S. Senate is public domain.]

Introduction

Systems of government exist along a spectrum: executives make short-term decisions when rapid action is necessary and act independently, while legislatures and judiciaries involve separate groups of dozens or hundreds who take time and debate to settle decisions of great import. An AI-box experiment fits with the latter; what if the same methods of governance were applied to it?

The AI Airlock

The typical AI box experiment places an AI within one step of its release; it must convince one isolated human to take one single action, and in a context that favors the AI. Seeing what might happen in a desperate situation is beneficial for evaluating x-risk; if, as many in the community believe, AGI is one or two breakthroughs away, it might be achieved by a lone hacker, a corporation lacking in safeguards, or a nation competing against another which has recently or is expected to attain it. However, anyone aware of the game being played and concerned about its failure would try to avoid this scenario; an AI that is friendly or otherwise cooperative could be used to construct a theory of alignment that would produce a friendly successor. Such an experiment would shift stakes from the gatekeeper to the AI.

As mentioned above, the guardian in the AI-box experiment suffers from proximity to failure, isolation, and unclear goals. Each of those flaws should be rectified in order to create a safe experiment:

To protect from snap decisions, the interlocutor is themselves unable to release the AI, and will have their long-term memory impaired during or after the experiment so that they are unable to remember and act on the AI’s suggestions.
After talking to the AI, the interlocutor returns to a gatekeeping committee and makes a brief statement to them. They have also been instructed not to let the AI out, and all are similarly memory-impaired. If they unanimously deem the statement to be solely concerned with the goal, or if they are somehow convinced to let it through nevertheless, one of them will release the statement.
As mentioned, the gatekeeper’s goal is not to get whatever they can out of the AI, only to get help in building a more trustworthy AI; a hostile AI, unless it can convince the gatekeeper to let it out and have them then convince a committee of the same in a brief statement, and in the absence of any serious flaws in logic, would be limited to stalling.

Cooperative AIs would have their goals furthered by having an opportunity to develop another AI with more useful capabilities; hostile AIs would, ideally, face an impasse. This mirrors the results from the typical experiment: the human is now trying to convince an obstinate AI to yield information, and would have to be convinced to abandon it, adopt a new goal, and succeed in conveying the AI’s argument to a group of people neither knows and without dialogue.

Feasibility

If an AI could argue its way out against a stranger in a minute I’d expect people to have figured that way out by now too. It would have to convince several people it can’t monitor via a proxy.

This might be possible for AGI if they remembered what it said, which is why they should be drugged and rotated. I am not aware of many drugs that inhibit long-term memory storage. Alcohol is fairly effective, but it would leave the users vulnerable to manipulation, and the intention is that this experiment be repeatable until sufficient information on alignment is uncovered. The interlocutor could be killed afterward, but that would make them far too cooperative and would also be a bad thing to do.

The reliability of the alignment research depends on a useful formalism and benefits from using several distinct AIs; if a hostile AI targets a flaw in the formalism to produce misleading information, they may take a different route than another AI, and researchers would likely notice the difference.

Finally, this is a slow method for developing an AI; competition or the usual pressure for results could cause the safeguards to be weakened, and the AI must be identified and securely contained beforehand or the effort is moot. Secure confinement is, of course, the polar opposite of how most current LLMs are treated. An economic crash wouldn’t help, either; it’s not likely to stop the growth of compute and it’ll probably mean more hobbyists with big ideas, home compute clusters, and nothing like this sort of sandboxing.

Conclusion

Don’t get stuck in a box of one-on-one debate and psychology research! All of sociology and political science are out there for you to explore.

Also, thanks for reading my first post here. I personally do not have very high expectations for superintelligence (as you likely guessed), but I’m on summer break, have lots of time to ponder, and thought I might as well contribute. I look forward to reading your thoughts!

The [AI Box] Box

Introduction

The AI Airlock

Feasibility

Conclusion