What can go wrong with the following protocol for AI containment?

1. Keep the AI in a box and don’t interact with it.

Don’t have any conversations with it whatsoever.

I’ll explain how this would work in a second, but first some motivation.

2. Consider all the great scientific discoveries over the course of human history—Newtonian mechanics, relativity, algorithms for primality testing, etc etc.

Could a “God” who is ``bad at math″ have created the universe in order to have us make these discoveries for him?

It sounds insane: surely any being with the power to create our universe would know how to check if a number is prime.

3. But perhaps it isn’t insane. Consider that at this point in its history, we

a) do not know how to factor numbers efficiently.

b) can create rudimentary simulated worlds (e.g., world of warcraft or your favorite MMO).

4. Here is how the scheme could work in more detail.

Imagine your typical world of warcraft server, but each orc and human is controlled by a trained neural network of complexity roughly the same as the average human.

The simulated world does not have enough natural food for orc or human populations to thrive. The life of man and orc is, in the words of Hobbes, “poor, nasty, brutish, and short.”

But it does contain an ocean which washes up pebbles in the shore. Pebbles come in two types, red and black. The black pebbles have labels, which are randomly assigned from the set “0”, “1“, “2”, “3”, “4”, “5“, “6”, “7”, “8”, “9″, “x”, “=”. The red pebbles have strings of the same symbols written on them.

When you arrange the following sequence of pebbles

2x3=6

(all in black up to and including the “=” and the last “6″ is red)

they crack open, and inside is enough food to feed a village for a year.

On the other hand, 2x3=5 has no effect.

The longer the prime factorization, the more food is in the pebbles.

No other way to open the pebbles exists.

Once you have arranged 2x3=6, the next time you arrange it with new pebbles gives you 99% as much food; the next time after that, 99% of 99%, and so on.

5. Societies in this simulated world will devote a lot of effort to figuring out the prime factorizations of the red numbers which wash up on the shore.

We program things so that, over time, length of the numbers on the red pebbles which get washed up on the shore gets longer and longer.

Since computers are so much faster than humans, and the orcs/​humans in this world are about as smart as we are, we might let this society run for a few million simulated human-years before examining the data dump from it. If a polynomial time algorithm for factoring primes exist, they will likely have found it.

6. Given more difficult problems we want solved, we can imagine other ways of embedding them into the physics of the simulated world. Consider, for example, how the Schrodinger equation is embedded into the fabric of our universe.

7. My question is: what can go wrong?

I can think of a couple of potential issues and am wondering whether I missed anything.

8. One problem is, of course, that someone might misuse the technology, running this experiment while breaking the protocol and telling the simulated people facts about our world. Of course, this is a generic problem with any protocol for any problem (bad things happen if you don’t follow it).

A related problem is that we might have activists who advocate for the rights of simulated man/​orc; such activists might interrupt one of the experiments halfway through and break the protocol.

We might lessen the risk of this by, say, making each experiment take less than one second of real time. Given that the current speed of the top supercomputer in the world is on the order of 10^16 flops per second, and projecting that into the future, one second might equal millions of years of simulated time.

This is actually a good idea also for the following reason: the simulated people might realize that they are in a simulated world.

They might spend a lot of effort begging us to give them immortality or more pebbles. We might give in if we heard them. Best to make it impossible for us to hear their pleas.

They will eventually cease trying to contact us once they realize no one is listening.

Ethically, I think one could justify all this. It is hard to argue, for example, that we (real human beings) have been harmed by being brought into existence in a universe without a God who is listening; almost all of us would prefer to be alive rather not. The same would go for them: surely, their simulated existence, imperfect as it is, is not worse than not having been brought into the world in the first place?

They might become discouraged once they receive no answers from us, but I think this is not inevitable. Suppose, for example, you come to believe the real world has been created for the purpose of solving math problems; most of us would be shaken by this but likely would go on to live our lives as before.

Anyway, if we wanted to avoid all this we might consider more subtle ways of embedding problems into the rules of their world.

9. Perhaps the most serious problem might be if there is a bug in our code for the simulation.

In the worst case, they might be able to come up with some kind of exploit that lets them see anything stored on the computer running the simulation, in particular the source code of their universe. One might imagine they would even be able to modify said source code.

One possible way to solve this would to write the simulation with the aid of some kind of program-checking software that ensures things like this will not happen. Already, people are playing around with tools that enable you to write code which comes with a guarantee that the program will not enter into certain states.

Still, even if a bug were to occur, it seems it would be quite difficult for them to take over the world—though perhaps not impossible. I don’t think they could obtain knowledge of the hardware they are running on—only the software (people who know computer architecture, correct me if I’m wrong?). And they don’t know the physics of our world. If the computer which runs the simulation is not connected to any network—say it is stored underground in a soundproof vault—I’m at a loss to imagine how they would take over the world.

Edit: note that the last paragraph was a parenthetical remark—the real solution is to use program checking to ensure no violation of the physical laws of their world happens.

10. Why would we do any of this relative to, say, mind uploads which run at many times the speed of ordinary humans? Because an uploaded mind knows quite a bit about human psychology and the physics of the world, and could potentially use this knowledge in ways harmful to us.

11. One can view this scheme as a kind of genetic algorithm (in a sense, it is a genetic algorithm).

12. Finally: many people have written about friendly AI, and I’ve not been able to read everything (or even most) of what is written. Apologies if something like this has been discussed at length before—in which case, I’d appreciate a pointer to the relevant discussions.

The most relevant thing I’ve seen here is That Alien Message, which is really close to item 9, but still imagines interaction with the AI in the box, giving the AI an opportunity to convince us to let it use the internet.

Update: Let me say something about why the simulated beings will not be able to figure out the physical laws of our world (though they might have a range of plausible guesses about it).

Imagine you live in a world governed by Newtonian mechanics: every experiment you do is perfectly, 100% explained by Newton’s laws. You come to believe you live in a simulation intended to force you to solve second order differential equations.

What can you deduce about the physics of the real world?

Answer: nothing. Newton’s laws is the only information you have. And that is just not enough information.

Perhaps the real world is governed by Newton’s laws just like yours. Or maybe it has general relativity (you are smart enough to figure out that is a possibility). Or maybe it is governed by F=m x‴. Electromagnetism might exist or it might not. Ditto for quantum mechanics. The real world could be discrete of continuous. How big could g, G, and Planck’s constant be in the real world? Is there another force like gravity which roughly acts on the same scale? In fact, the possibilities are endless.

You just don’t know.

Now suppose you are actually a superintelligence. You are really, really smart. What can you infer about the laws of physics which govern real world?

The answer, I believe, the same as above.

A superintelligence may be smart, but its abilities are not without limit. It cannot solve a problem for which there is not enough information to solve it.

Consider that world of warcraft feels the same an orc regardless of whether it runs on a pc or a mac of a quantum computer. Just experiencing a world with Newton’s laws will not reveal any information about the hardware your universe is running on.

I do acknowledge, however, that the super-intelligence might come up with a large list of possible guesses about what our physics could be.

Update 2: If it sounds to you like the AIs might take over our world with this scheme, consider the following scenario.

Suppose that we terminate the simulation after a million years. We have a particular second order differential equation in mind before starting the simulation. After termination, every record of the simulated world is destroyed, except for any time someone has written a solution to this specific differential equation on a piece of paper. This is done automatically by the code. An automatic differentiation algorithm checks whether the solution is correct. If it is, it shows us the solution; else, it deletes it along with the rest of the record of the world.

The output is either nothing, or a solution to the differential equation we wanted solved.

(of course, this is a toy scenario; more realistically, we will want a solution to a more difficult scientific problem)

Assuming no bugs in the code, can you give an account of how a superintelligence could take over the real world in this scenario?