Safely and usefully spectating on AIs optimizing over toy worlds

Consider an AI that is trying to achieve a certain result in a toy world running on a computer. Compare two models of what the AI is and what it’s trying to do: first, you could say the AI is a physical program on a computer, which is trying to cause the physical computer that the toy world is running on to enter a certain state. Alternatively, you could say that the AI is an abstract computational process which is trying achieve certain results in another abstract computational process (the toy world) that it is interfacing with.

On the first view, if the AI is clever enough, it might figure out how to manipulate the outside world, by, for instance, hacking into other computers to gain more computing power. On the second view, the outside world is irrelevant to the AI’s interests, since changing what’s running on certain physical computers in the real world would have no effect on the idealized computational model that the AI is optimizing over, so the AI has no incentive to optimize over our world.

AIs for which the second model is more accurate seem generally safer than AIs for which the first model is more accurate. So trying to encourage AI development to follow the second model could help delay the development of dangerous AGI.

AIs following this model are limited in some ways. For instance, they could not be used to figure out how to prevent the development of other dangerous AGI, since this requires reasoning about what happens in the real world.

But such AIs could still be quite useful for many things, such as engineering. In order to use AIs optimizing over toy worlds to design things that are useful in the real world, we could make the toy world have physics and materials similar enough to our world that designs that work well in the toy world should be expected to also work well in the real world. We then take the designs the AI builds in the toy world, and replicate them in the real world. If they don’t work in the real world, then we try to find the discrepancy between real-world physics and toy-world physics that accounts for it, fix the discrepancy, and try again.

One possible thing that could go catastrophically wrong with this strategy is if the design the AI comes up with has an agent in it. If the AI designs this agent to figure out what sort of world it’s in, rather than hard-coding the agent to believe it’s in the toy world that the original AI cares about, then this agent could, in the toy world, figure out the physics of the toy world and do something sensible, making it look like the design should work when we simulate it in the toy world. But then when we replicate the design in the real world, the agent that gets built with the design notices that it’s in a big world with humans that can be manipulated, computers that can be hacked, and so on, and does those things instead of acting as expected.

This problem could be addressed by trying to design the AI in such a way that it would not come up with solutions that involve creating agents, or by figuring out how to reliably detect agents in designs, so that we know to reject those. An alternative approach would be to design the AI in such a way that it can create agents, but only agents that share the property that it only directs its optimization towards the toy world and only builds agents with that same property.

Designing better AIs is an engineering task that an AI along these lines could be used for. Since in this case, the explicit purpose is creating agents, it requires a solution to the problem of creating agents that act in unintended ways that does not involve not creating any agents. Instead, we would want to formalize this notion of optimizing only over toy models, so that it can be used as a design constraint for the AIs that we’re asking our AI to design. If we can do this, then it would give us a possible route to a controlled intelligence explosion, in which the AI designs a more capable successor AI because that is the task it has been assigned, rather than for instrumental reasons, and humans can inspect the result and decide whether or not to run it.