AI Alignment Prize: Super-Boxing

In this post I propose an (allegedly...) secure AI boxing method. For reasons elaborated at the end of the post, this proposal is probably unsafe, and is certainly not practical. Still, I’ve elected to share my thoughts on this in the hope of receiving valuable feedback, as well a hope that discussing these ideas in public will provoke more useful thoughts in the minds of others.

Thank you for reading.

-- Xavier O’Rourke


Yudkowsky, Bostrom and others have illustrated the fundamental security problems which arise when attempting to mitigate the influence of a smarter-than-human intelligence by putting the agent inside a “box”.

“Boxing” an artificial intelligence means placing the agent in a virtual prison where it can manipulate data and execution on its own computing hardware, but is disconnected from any outside computer network. Perhaps there are other safeguards in place too such as only being able to communicate with human designers through a very low-bandwidth I/​O channel. Maybe we even surround the agent with a Faraday cage in case to prevent it communicating with the outside environment using electromagnetic radiation.

Boxing is fundamentally insecure because we can’t anticipate every possible strategy a super-intelligent agent may employ in order to escape. Even if a boxed AI could only communicate through text channels to a panel of trained safety experts, it’s highly plausible an agent with superhuman social engineering abilities would successfully manipulate them into releasing it.

And even if we have the smartest human physicists and engineers surround the boxed AIs with every conceivable physical obstacle, it’s highly plausible an agent with a superhuman understanding of nature could use it’s hardware to manipulate matter and energy in creative, unforeseen ways enabling it to still escape.

Some suggest monitoring boxed agents with mechanisms like tripwires which trigger after an agent crosses a dangerous intelligence threshold, (shutting the agent down when a tripwire is triggered). This solution is unsatisfactory because sufficiently intelligent agents will anticipate the existence of tripwires and be incentivised to avoid them, while still increasing their capabilities.

It would be very fortunate for us if boxing weren’t so insecure. It closes off a broad range of otherwise attractive strategies we could employ to solve the alignment problem (i.e. any strategy which involves coercing highly intelligent but unaligned agents into helping us solve the alignment problem).

If we could remove the incentive a boxed agent has to conceal their true plans and intentions from their supervisors we could gather valuable information: being empowered to run experiments involving the scaling up of powerful and potentially misaligned optimisation algorithms without instantly losing.

At first glance, this incentive problem may look easy: when the agent notices a flaw in the security of its prison its human supervisors would be delighted to have this vulnerability brought to their attention, so would offer the agent a substantial positive reward in return. Unfortunately this on its own doesn’t solve the problem, since however high the reward for explaining security loopholes to human designers, an agent may anticipate a yet greater reward from concealing information in order to actually escape at a later time.

Below I propose a technique which, (while impractical and highly fragile), might serve as a useful thought experiment for those aiming to soften a boxed agent’s incentive to escape.

We begin with some explicit assumptions. These assumptions aren’t rock-solid. Some objections are explored (and partially addressed) at the end of this post.

Assumption 1:

By the time an agent is capable of forming a viable escape plan it has also attained enough “intelligence” to understand its supervisors motives, and, (if the agent so wished), intentionally and productively dispense information to its human supervisors.

Assumption 2:

Is easier for an agent to notice vulnerabilities in its confinement, than it is for the same agent to actually exploit those vulnerabilities. In particular: the agent acts and receives rewards packaged according to discrete time-steps. Starting from the moment a security flaw it noticed, the agent is able to complete a warn action at an earlier time than an escape action.

Assumption 3:

We are able to construct agents in such a way that they are indifferent about what happens in the environment after they themselves are destroyed. This doesn’t mean the agent’s can’t have a survival instinct, just that the value of whatever utility function they aim to maximise only depends on things which happen before the agents death.

The Super-Box

Suppose we instantiate an agent, within the best virtual prison we can feasibly construct and run this agent for just a single time-step (we wouldn’t expect this agent to do much just yet). After the single time-step elapsed, the agent is destroyed and a new agent is instantiated. This second agent is run for two time-steps before being destroyed. Next we run a new agent for 3 times-teps, then 4, and so on...

Let us say, at time-step , the agent (which by now has grown reasonably intelligent), realises that a certain sequence of actions (concluding with an escaping action ) will lead to a non-trivial probability of escaping at time-step . By assumption 2, the agent could also take actions which somehow point the human supervisors attention towards the vulnerability in the box’s security the agent might have exploited (concluding with a warn action w which could be completed at time ).

By assumption 2, notifying the human designers of a flaw in the box’s security is easier than actually exploiting this vulnerability, so .

If these conditions are satisfied, then any agent whose intended life-span is between and time-steps (who, by assumption 1, is at this point smart enough to understand they will be terminated before they can complete the escape plan), will rationally choose the warn action so as to receive an extra reward before termination. The effects of this warn action which persist in the environment after the agent dies do not influence the agent’s decision making (by assumption 3).

If the human supervisors respond to warnings appropriately (by thoroughly closing all the holes in the box’s security when they are pointed out), then as the length of time we run the agents for steadily increases, and agents reach increasing levels of intelligence before termination, the quality of their prison correspondingly improves too. Each agent can only escape by winning an unfair game against her ancestors (who are only slightly less intelligent than herself), so an escape never occurs.

Now that these intelligent agents have been effectively constrained, we may proceed to construct additional incentives to motivate our safely contained intelligent agents to help us solve the alignment problem.

Shortcomings of this approach

An obvious shortcoming of this approach is that it involves intentionally creating potentially powerful and misaligned agents. I agree this makes our super-boxing strategy vastly inferior to any approach which actually solves the alignment problem before we go demon-summoning. Still, even though a super-boxing strategy resembling this one can’t be a substitute for true safety guarantees, it might still have value as an additional precaution or last resort.

Another obvious pitfall of this algorithm is that, if it takes time-steps to train an AGI normally, it will take approximately units of time to train one in a super-box. No team enthralled in an AI race would accept this handicap, so for super-boxing to become feasible the time-complexity of this strategy needs to dramatically improve.

It is also highly likely that the first AGI will use an awful lot of compute power, thus making it impractical to run on a single piece of hardware in a single location.

Assumption 1 might not be safe: I’m sure we all remember the evolutionary algorithm which built an oscillator without any capacitor which “reconfigured its sensor-less motherboard into a makeshift radio receiver, using the printed circuit board tracks as an aerial to pick up signals generated by personal computers that happened to be situated nearby in the Laboratory” (Bostrom).

We would also be right to treat assumption 2 with a good deal of scepticism. Maybe explaining an ingenious escape plan to stupid humans is extremely difficult and time consuming (like us trying to explain fire to a chimp). In a similar vein, perhaps agents would be incentivised to accumulate reward by giving us information which on its face appears useful for strengthening our containment mechanisms, but turns out to actually be useless. These difficulties might be possible to mitigate by adopting a policy where, upon receiving information, the human designers put the agent into stasis while we spend as long as we need to verify the advice we’ve just received. After the legitimacy/​illegitimacy of the information has been established, we wake the agent up and administer a final reward/​punishment before killing it.

Another issue: Even if we ensure the agent’s utility functions won’t be concerned with the state of the environment after the agent’s own destruction, it’s plausible that a wide variety of learning algorithms end up converging on a decision making procedure which resembles something like Functional Decision Theory (i.e. one-boxing on Newcomb’s problem, forming successful voting coalitions in elections, resisting blackmail, and cooperating with agents similar to themselves in 1-shot prisoner’s dilemmas).

Agents operating according to functional decision theory, or something resembling FDT, may decide to acausally cooperate with their future descendants. Reasoning that they would, counter-factually, be better off if the output of their decision procedure (and the decision procedure of all similar agents) was always “DON’T WARN THE HUMANS”.

It isn’t impossible to build causal decision theorists, but it is also non-trivial to verify that our current state-of-the art AI architectures definitely won’t converge on FTD-style decision making.

A potential remedy to this shortcoming could be to instantiate different agents, with different, conflicting utility functions (perhaps rolled at random), at each time-step. But this amounts to a disgustingly dangerous and delicate gamble, and almost certainly doesn’t constitute our best possible AI alignment strategy.

Finally, if agents consider the potential reward of escaping to be great enough then the dismal prospects of escaping before termination may not be sufficient to coerce them into ratting out their children. Agents may instead elect to devote all their resources towards escape attempts regardless of how unlikely they are to succeed (similar to a Pascal Mugging). Human designers could misconstrue this rational behaviour for unintelligence , leading us to falsely believe it’s okay to continue ramping up the agent’s lifetimes.


Thanks for making it to the end of this post. I will deeply appreciate any discussion, refinements or criticisms in the comments.