AI Alignment Prize: Super-Boxing

In this post I pro­pose an (allegedly...) se­cure AI box­ing method. For rea­sons elab­o­rated at the end of the post, this pro­posal is prob­a­bly un­safe, and is cer­tainly not prac­ti­cal. Still, I’ve elected to share my thoughts on this in the hope of re­ceiv­ing valuable feed­back, as well a hope that dis­cussing these ideas in pub­lic will pro­voke more use­ful thoughts in the minds of oth­ers.

Thank you for read­ing.

-- Xavier O’Rourke


Yud­kowsky, Bostrom and oth­ers have illus­trated the fun­da­men­tal se­cu­rity prob­lems which arise when at­tempt­ing to miti­gate the in­fluence of a smarter-than-hu­man in­tel­li­gence by putting the agent in­side a “box”.

“Box­ing” an ar­tifi­cial in­tel­li­gence means plac­ing the agent in a vir­tual prison where it can ma­nipu­late data and ex­e­cu­tion on its own com­put­ing hard­ware, but is dis­con­nected from any out­side com­puter net­work. Per­haps there are other safe­guards in place too such as only be­ing able to com­mu­ni­cate with hu­man de­sign­ers through a very low-band­width I/​O chan­nel. Maybe we even sur­round the agent with a Fara­day cage in case to pre­vent it com­mu­ni­cat­ing with the out­side en­vi­ron­ment us­ing elec­tro­mag­netic ra­di­a­tion.

Box­ing is fun­da­men­tally in­se­cure be­cause we can’t an­ti­ci­pate ev­ery pos­si­ble strat­egy a su­per-in­tel­li­gent agent may em­ploy in or­der to es­cape. Even if a boxed AI could only com­mu­ni­cate through text chan­nels to a panel of trained safety ex­perts, it’s highly plau­si­ble an agent with su­per­hu­man so­cial en­g­ineer­ing abil­ities would suc­cess­fully ma­nipu­late them into re­leas­ing it.

And even if we have the smartest hu­man physi­cists and en­g­ineers sur­round the boxed AIs with ev­ery con­ceiv­able phys­i­cal ob­sta­cle, it’s highly plau­si­ble an agent with a su­per­hu­man un­der­stand­ing of na­ture could use it’s hard­ware to ma­nipu­late mat­ter and en­ergy in cre­ative, un­fore­seen ways en­abling it to still es­cape.

Some sug­gest mon­i­tor­ing boxed agents with mechanisms like trip­wires which trig­ger af­ter an agent crosses a dan­ger­ous in­tel­li­gence thresh­old, (shut­ting the agent down when a trip­wire is trig­gered). This solu­tion is un­satis­fac­tory be­cause suffi­ciently in­tel­li­gent agents will an­ti­ci­pate the ex­is­tence of trip­wires and be in­cen­tivised to avoid them, while still in­creas­ing their ca­pa­bil­ities.

It would be very for­tu­nate for us if box­ing weren’t so in­se­cure. It closes off a broad range of oth­er­wise at­trac­tive strate­gies we could em­ploy to solve the al­ign­ment prob­lem (i.e. any strat­egy which in­volves co­erc­ing highly in­tel­li­gent but un­al­igned agents into helping us solve the al­ign­ment prob­lem).

If we could re­move the in­cen­tive a boxed agent has to con­ceal their true plans and in­ten­tions from their su­per­vi­sors we could gather valuable in­for­ma­tion: be­ing em­pow­ered to run ex­per­i­ments in­volv­ing the scal­ing up of pow­er­ful and po­ten­tially mis­al­igned op­ti­mi­sa­tion al­gorithms with­out in­stantly los­ing.

At first glance, this in­cen­tive prob­lem may look easy: when the agent no­tices a flaw in the se­cu­rity of its prison its hu­man su­per­vi­sors would be delighted to have this vuln­er­a­bil­ity brought to their at­ten­tion, so would offer the agent a sub­stan­tial pos­i­tive re­ward in re­turn. Un­for­tu­nately this on its own doesn’t solve the prob­lem, since how­ever high the re­ward for ex­plain­ing se­cu­rity loop­holes to hu­man de­sign­ers, an agent may an­ti­ci­pate a yet greater re­ward from con­ceal­ing in­for­ma­tion in or­der to ac­tu­ally es­cape at a later time.

Below I pro­pose a tech­nique which, (while im­prac­ti­cal and highly frag­ile), might serve as a use­ful thought ex­per­i­ment for those aiming to soften a boxed agent’s in­cen­tive to es­cape.

We be­gin with some ex­plicit as­sump­tions. Th­ese as­sump­tions aren’t rock-solid. Some ob­jec­tions are ex­plored (and par­tially ad­dressed) at the end of this post.

As­sump­tion 1:

By the time an agent is ca­pa­ble of form­ing a vi­able es­cape plan it has also at­tained enough “in­tel­li­gence” to un­der­stand its su­per­vi­sors mo­tives, and, (if the agent so wished), in­ten­tion­ally and pro­duc­tively dis­pense in­for­ma­tion to its hu­man su­per­vi­sors.

As­sump­tion 2:

Is eas­ier for an agent to no­tice vuln­er­a­bil­ities in its con­fine­ment, than it is for the same agent to ac­tu­ally ex­ploit those vuln­er­a­bil­ities. In par­tic­u­lar: the agent acts and re­ceives re­wards pack­aged ac­cord­ing to dis­crete time-steps. Start­ing from the mo­ment a se­cu­rity flaw it no­ticed, the agent is able to com­plete a warn ac­tion at an ear­lier time than an es­cape ac­tion.

As­sump­tion 3:

We are able to con­struct agents in such a way that they are in­differ­ent about what hap­pens in the en­vi­ron­ment af­ter they them­selves are de­stroyed. This doesn’t mean the agent’s can’t have a sur­vival in­stinct, just that the value of what­ever util­ity func­tion they aim to max­imise only de­pends on things which hap­pen be­fore the agents death.

The Su­per-Box

Sup­pose we in­stan­ti­ate an agent, within the best vir­tual prison we can fea­si­bly con­struct and run this agent for just a sin­gle time-step (we wouldn’t ex­pect this agent to do much just yet). After the sin­gle time-step elapsed, the agent is de­stroyed and a new agent is in­stan­ti­ated. This sec­ond agent is run for two time-steps be­fore be­ing de­stroyed. Next we run a new agent for 3 times-teps, then 4, and so on...

Let us say, at time-step , the agent (which by now has grown rea­son­ably in­tel­li­gent), re­al­ises that a cer­tain se­quence of ac­tions (con­clud­ing with an es­cap­ing ac­tion ) will lead to a non-triv­ial prob­a­bil­ity of es­cap­ing at time-step . By as­sump­tion 2, the agent could also take ac­tions which some­how point the hu­man su­per­vi­sors at­ten­tion to­wards the vuln­er­a­bil­ity in the box’s se­cu­rity the agent might have ex­ploited (con­clud­ing with a warn ac­tion w which could be com­pleted at time ).

By as­sump­tion 2, no­tify­ing the hu­man de­sign­ers of a flaw in the box’s se­cu­rity is eas­ier than ac­tu­ally ex­ploit­ing this vuln­er­a­bil­ity, so .

If these con­di­tions are satis­fied, then any agent whose in­tended life-span is be­tween and time-steps (who, by as­sump­tion 1, is at this point smart enough to un­der­stand they will be ter­mi­nated be­fore they can com­plete the es­cape plan), will ra­tio­nally choose the warn ac­tion so as to re­ceive an ex­tra re­ward be­fore ter­mi­na­tion. The effects of this warn ac­tion which per­sist in the en­vi­ron­ment af­ter the agent dies do not in­fluence the agent’s de­ci­sion mak­ing (by as­sump­tion 3).

If the hu­man su­per­vi­sors re­spond to warn­ings ap­pro­pri­ately (by thor­oughly clos­ing all the holes in the box’s se­cu­rity when they are pointed out), then as the length of time we run the agents for steadily in­creases, and agents reach in­creas­ing lev­els of in­tel­li­gence be­fore ter­mi­na­tion, the qual­ity of their prison cor­re­spond­ingly im­proves too. Each agent can only es­cape by win­ning an un­fair game against her an­ces­tors (who are only slightly less in­tel­li­gent than her­self), so an es­cape never oc­curs.

Now that these in­tel­li­gent agents have been effec­tively con­strained, we may pro­ceed to con­struct ad­di­tional in­cen­tives to mo­ti­vate our safely con­tained in­tel­li­gent agents to help us solve the al­ign­ment prob­lem.

Short­com­ings of this approach

An ob­vi­ous short­com­ing of this ap­proach is that it in­volves in­ten­tion­ally cre­at­ing po­ten­tially pow­er­ful and mis­al­igned agents. I agree this makes our su­per-box­ing strat­egy vastly in­fe­rior to any ap­proach which ac­tu­ally solves the al­ign­ment prob­lem be­fore we go de­mon-sum­mon­ing. Still, even though a su­per-box­ing strat­egy re­sem­bling this one can’t be a sub­sti­tute for true safety guaran­tees, it might still have value as an ad­di­tional pre­cau­tion or last re­sort.

Another ob­vi­ous pit­fall of this al­gorithm is that, if it takes time-steps to train an AGI nor­mally, it will take ap­prox­i­mately units of time to train one in a su­per-box. No team en­thralled in an AI race would ac­cept this hand­i­cap, so for su­per-box­ing to be­come fea­si­ble the time-com­plex­ity of this strat­egy needs to dra­mat­i­cally im­prove.

It is also highly likely that the first AGI will use an awful lot of com­pute power, thus mak­ing it im­prac­ti­cal to run on a sin­gle piece of hard­ware in a sin­gle lo­ca­tion.

As­sump­tion 1 might not be safe: I’m sure we all re­mem­ber the evolu­tion­ary al­gorithm which built an os­cilla­tor with­out any ca­pac­i­tor which “re­con­figured its sen­sor-less moth­er­board into a makeshift ra­dio re­ceiver, us­ing the printed cir­cuit board tracks as an aerial to pick up sig­nals gen­er­ated by per­sonal com­put­ers that hap­pened to be situ­ated nearby in the Lab­o­ra­tory” (Bostrom).

We would also be right to treat as­sump­tion 2 with a good deal of scep­ti­cism. Maybe ex­plain­ing an in­ge­nious es­cape plan to stupid hu­mans is ex­tremely difficult and time con­sum­ing (like us try­ing to ex­plain fire to a chimp). In a similar vein, per­haps agents would be in­cen­tivised to ac­cu­mu­late re­ward by giv­ing us in­for­ma­tion which on its face ap­pears use­ful for strength­en­ing our con­tain­ment mechanisms, but turns out to ac­tu­ally be use­less. Th­ese difficul­ties might be pos­si­ble to miti­gate by adopt­ing a policy where, upon re­ceiv­ing in­for­ma­tion, the hu­man de­sign­ers put the agent into sta­sis while we spend as long as we need to ver­ify the ad­vice we’ve just re­ceived. After the le­gi­t­i­macy/​ille­gi­t­i­macy of the in­for­ma­tion has been es­tab­lished, we wake the agent up and ad­minister a fi­nal re­ward/​pun­ish­ment be­fore kil­ling it.

Another is­sue: Even if we en­sure the agent’s util­ity func­tions won’t be con­cerned with the state of the en­vi­ron­ment af­ter the agent’s own de­struc­tion, it’s plau­si­ble that a wide va­ri­ety of learn­ing al­gorithms end up con­verg­ing on a de­ci­sion mak­ing pro­ce­dure which re­sem­bles some­thing like Func­tional De­ci­sion The­ory (i.e. one-box­ing on New­comb’s prob­lem, form­ing suc­cess­ful vot­ing coal­i­tions in elec­tions, re­sist­ing black­mail, and co­op­er­at­ing with agents similar to them­selves in 1-shot pris­oner’s dilem­mas).

Agents op­er­at­ing ac­cord­ing to func­tional de­ci­sion the­ory, or some­thing re­sem­bling FDT, may de­cide to acausally co­op­er­ate with their fu­ture de­scen­dants. Rea­son­ing that they would, counter-fac­tu­ally, be bet­ter off if the out­put of their de­ci­sion pro­ce­dure (and the de­ci­sion pro­ce­dure of all similar agents) was always “DON’T WARN THE HUMANS”.

It isn’t im­pos­si­ble to build causal de­ci­sion the­o­rists, but it is also non-triv­ial to ver­ify that our cur­rent state-of-the art AI ar­chi­tec­tures definitely won’t con­verge on FTD-style de­ci­sion mak­ing.

A po­ten­tial rem­edy to this short­com­ing could be to in­stan­ti­ate differ­ent agents, with differ­ent, con­flict­ing util­ity func­tions (per­haps rol­led at ran­dom), at each time-step. But this amounts to a dis­gust­ingly dan­ger­ous and del­i­cate gam­ble, and al­most cer­tainly doesn’t con­sti­tute our best pos­si­ble AI al­ign­ment strat­egy.

Fi­nally, if agents con­sider the po­ten­tial re­ward of es­cap­ing to be great enough then the dis­mal prospects of es­cap­ing be­fore ter­mi­na­tion may not be suffi­cient to co­erce them into rat­ting out their chil­dren. Agents may in­stead elect to de­vote all their re­sources to­wards es­cape at­tempts re­gard­less of how un­likely they are to suc­ceed (similar to a Pas­cal Mug­ging). Hu­man de­sign­ers could mis­con­strue this ra­tio­nal be­havi­our for un­in­tel­li­gence , lead­ing us to falsely be­lieve it’s okay to con­tinue ramp­ing up the agent’s life­times.


Thanks for mak­ing it to the end of this post. I will deeply ap­pre­ci­ate any dis­cus­sion, re­fine­ments or crit­i­cisms in the com­ments.

No nominations.