What can go wrong with the following protocol for AI containment?

1. Keep the AI in a box and don’t in­ter­act with it.

Don’t have any con­ver­sa­tions with it what­so­ever.

I’ll ex­plain how this would work in a sec­ond, but first some mo­ti­va­tion.

2. Con­sider all the great sci­en­tific dis­cov­er­ies over the course of hu­man his­tory—New­to­nian me­chan­ics, rel­a­tivity, al­gorithms for pri­mal­ity test­ing, etc etc.

Could a “God” who is ``bad at math″ have cre­ated the uni­verse in or­der to have us make these dis­cov­er­ies for him?

It sounds in­sane: surely any be­ing with the power to cre­ate our uni­verse would know how to check if a num­ber is prime.

3. But per­haps it isn’t in­sane. Con­sider that at this point in its his­tory, we

a) do not know how to fac­tor num­bers effi­ciently.

b) can cre­ate rudi­men­tary simu­lated wor­lds (e.g., world of war­craft or your fa­vorite MMO).

4. Here is how the scheme could work in more de­tail.

Imag­ine your typ­i­cal world of war­craft server, but each orc and hu­man is con­trol­led by a trained neu­ral net­work of com­plex­ity roughly the same as the av­er­age hu­man.

The simu­lated world does not have enough nat­u­ral food for orc or hu­man pop­u­la­tions to thrive. The life of man and orc is, in the words of Hobbes, “poor, nasty, brutish, and short.”

But it does con­tain an ocean which washes up peb­bles in the shore. Peb­bles come in two types, red and black. The black peb­bles have la­bels, which are ran­domly as­signed from the set “0”, “1“, “2”, “3”, “4”, “5“, “6”, “7”, “8”, “9″, “x”, “=”. The red peb­bles have strings of the same sym­bols writ­ten on them.

When you ar­range the fol­low­ing se­quence of peb­bles

2x3=6

(all in black up to and in­clud­ing the “=” and the last “6″ is red)

they crack open, and in­side is enough food to feed a village for a year.

On the other hand, 2x3=5 has no effect.

The longer the prime fac­tor­iza­tion, the more food is in the peb­bles.

No other way to open the peb­bles ex­ists.

Once you have ar­ranged 2x3=6, the next time you ar­range it with new peb­bles gives you 99% as much food; the next time af­ter that, 99% of 99%, and so on.

5. So­cieties in this simu­lated world will de­vote a lot of effort to figur­ing out the prime fac­tor­iza­tions of the red num­bers which wash up on the shore.

We pro­gram things so that, over time, length of the num­bers on the red peb­bles which get washed up on the shore gets longer and longer.

Since com­put­ers are so much faster than hu­mans, and the orcs/​hu­mans in this world are about as smart as we are, we might let this so­ciety run for a few mil­lion simu­lated hu­man-years be­fore ex­am­in­ing the data dump from it. If a polyno­mial time al­gorithm for fac­tor­ing primes ex­ist, they will likely have found it.

6. Given more difficult prob­lems we want solved, we can imag­ine other ways of em­bed­ding them into the physics of the simu­lated world. Con­sider, for ex­am­ple, how the Schrod­inger equa­tion is em­bed­ded into the fabric of our uni­verse.

7. My ques­tion is: what can go wrong?

I can think of a cou­ple of po­ten­tial is­sues and am won­der­ing whether I missed any­thing.

8. One prob­lem is, of course, that some­one might mi­suse the tech­nol­ogy, run­ning this ex­per­i­ment while break­ing the pro­to­col and tel­ling the simu­lated peo­ple facts about our world. Of course, this is a generic prob­lem with any pro­to­col for any prob­lem (bad things hap­pen if you don’t fol­low it).

A re­lated prob­lem is that we might have ac­tivists who ad­vo­cate for the rights of simu­lated man/​orc; such ac­tivists might in­ter­rupt one of the ex­per­i­ments halfway through and break the pro­to­col.

We might lessen the risk of this by, say, mak­ing each ex­per­i­ment take less than one sec­ond of real time. Given that the cur­rent speed of the top su­per­com­puter in the world is on the or­der of 10^16 flops per sec­ond, and pro­ject­ing that into the fu­ture, one sec­ond might equal mil­lions of years of simu­lated time.

This is ac­tu­ally a good idea also for the fol­low­ing rea­son: the simu­lated peo­ple might re­al­ize that they are in a simu­lated world.

They might spend a lot of effort beg­ging us to give them im­mor­tal­ity or more peb­bles. We might give in if we heard them. Best to make it im­pos­si­ble for us to hear their pleas.

They will even­tu­ally cease try­ing to con­tact us once they re­al­ize no one is listen­ing.

Eth­i­cally, I think one could jus­tify all this. It is hard to ar­gue, for ex­am­ple, that we (real hu­man be­ings) have been harmed by be­ing brought into ex­is­tence in a uni­verse with­out a God who is listen­ing; al­most all of us would pre­fer to be al­ive rather not. The same would go for them: surely, their simu­lated ex­is­tence, im­perfect as it is, is not worse than not hav­ing been brought into the world in the first place?

They might be­come dis­cour­aged once they re­ceive no an­swers from us, but I think this is not in­evitable. Sup­pose, for ex­am­ple, you come to be­lieve the real world has been cre­ated for the pur­pose of solv­ing math prob­lems; most of us would be shaken by this but likely would go on to live our lives as be­fore.

Any­way, if we wanted to avoid all this we might con­sider more sub­tle ways of em­bed­ding prob­lems into the rules of their world.

9. Per­haps the most se­ri­ous prob­lem might be if there is a bug in our code for the simu­la­tion.

In the worst case, they might be able to come up with some kind of ex­ploit that lets them see any­thing stored on the com­puter run­ning the simu­la­tion, in par­tic­u­lar the source code of their uni­verse. One might imag­ine they would even be able to mod­ify said source code.

One pos­si­ble way to solve this would to write the simu­la­tion with the aid of some kind of pro­gram-check­ing soft­ware that en­sures things like this will not hap­pen. Already, peo­ple are play­ing around with tools that en­able you to write code which comes with a guaran­tee that the pro­gram will not en­ter into cer­tain states.

Still, even if a bug were to oc­cur, it seems it would be quite difficult for them to take over the world—though per­haps not im­pos­si­ble. I don’t think they could ob­tain knowl­edge of the hard­ware they are run­ning on—only the soft­ware (peo­ple who know com­puter ar­chi­tec­ture, cor­rect me if I’m wrong?). And they don’t know the physics of our world. If the com­puter which runs the simu­la­tion is not con­nected to any net­work—say it is stored un­der­ground in a sound­proof vault—I’m at a loss to imag­ine how they would take over the world.

Edit: note that the last para­graph was a par­en­thet­i­cal re­mark—the real solu­tion is to use pro­gram check­ing to en­sure no vi­o­la­tion of the phys­i­cal laws of their world hap­pens.

10. Why would we do any of this rel­a­tive to, say, mind up­loads which run at many times the speed of or­di­nary hu­mans? Be­cause an up­loaded mind knows quite a bit about hu­man psy­chol­ogy and the physics of the world, and could po­ten­tially use this knowl­edge in ways harm­ful to us.

11. One can view this scheme as a kind of ge­netic al­gorithm (in a sense, it is a ge­netic al­gorithm).

12. Fi­nally: many peo­ple have writ­ten about friendly AI, and I’ve not been able to read ev­ery­thing (or even most) of what is writ­ten. Apolo­gies if some­thing like this has been dis­cussed at length be­fore—in which case, I’d ap­pre­ci­ate a poin­ter to the rele­vant dis­cus­sions.

The most rele­vant thing I’ve seen here is That Alien Mes­sage, which is re­ally close to item 9, but still imag­ines in­ter­ac­tion with the AI in the box, giv­ing the AI an op­por­tu­nity to con­vince us to let it use the in­ter­net.

Up­date: Let me say some­thing about why the simu­lated be­ings will not be able to figure out the phys­i­cal laws of our world (though they might have a range of plau­si­ble guesses about it).

Imag­ine you live in a world gov­erned by New­to­nian me­chan­ics: ev­ery ex­per­i­ment you do is perfectly, 100% ex­plained by New­ton’s laws. You come to be­lieve you live in a simu­la­tion in­tended to force you to solve sec­ond or­der differ­en­tial equa­tions.

What can you de­duce about the physics of the real world?

An­swer: noth­ing. New­ton’s laws is the only in­for­ma­tion you have. And that is just not enough in­for­ma­tion.

Per­haps the real world is gov­erned by New­ton’s laws just like yours. Or maybe it has gen­eral rel­a­tivity (you are smart enough to figure out that is a pos­si­bil­ity). Or maybe it is gov­erned by F=m x‴. Elec­tro­mag­netism might ex­ist or it might not. Ditto for quan­tum me­chan­ics. The real world could be dis­crete of con­tin­u­ous. How big could g, G, and Planck’s con­stant be in the real world? Is there an­other force like grav­ity which roughly acts on the same scale? In fact, the pos­si­bil­ities are end­less.

You just don’t know.

Now sup­pose you are ac­tu­ally a su­per­in­tel­li­gence. You are re­ally, re­ally smart. What can you in­fer about the laws of physics which gov­ern real world?

The an­swer, I be­lieve, the same as above.

A su­per­in­tel­li­gence may be smart, but its abil­ities are not with­out limit. It can­not solve a prob­lem for which there is not enough in­for­ma­tion to solve it.

Con­sider that world of war­craft feels the same an orc re­gard­less of whether it runs on a pc or a mac of a quan­tum com­puter. Just ex­pe­rienc­ing a world with New­ton’s laws will not re­veal any in­for­ma­tion about the hard­ware your uni­verse is run­ning on.

I do ac­knowl­edge, how­ever, that the su­per-in­tel­li­gence might come up with a large list of pos­si­ble guesses about what our physics could be.

Up­date 2: If it sounds to you like the AIs might take over our world with this scheme, con­sider the fol­low­ing sce­nario.

Sup­pose that we ter­mi­nate the simu­la­tion af­ter a mil­lion years. We have a par­tic­u­lar sec­ond or­der differ­en­tial equa­tion in mind be­fore start­ing the simu­la­tion. After ter­mi­na­tion, ev­ery record of the simu­lated world is de­stroyed, ex­cept for any time some­one has writ­ten a solu­tion to this spe­cific differ­en­tial equa­tion on a piece of pa­per. This is done au­to­mat­i­cally by the code. An au­to­matic differ­en­ti­a­tion al­gorithm checks whether the solu­tion is cor­rect. If it is, it shows us the solu­tion; else, it deletes it along with the rest of the record of the world.

The out­put is ei­ther noth­ing, or a solu­tion to the differ­en­tial equa­tion we wanted solved.

(of course, this is a toy sce­nario; more re­al­is­ti­cally, we will want a solu­tion to a more difficult sci­en­tific prob­lem)

As­sum­ing no bugs in the code, can you give an ac­count of how a su­per­in­tel­li­gence could take over the real world in this sce­nario?