Safely and usefully spectating on AIs optimizing over toy worlds

Con­sider an AI that is try­ing to achieve a cer­tain re­sult in a toy world run­ning on a com­puter. Com­pare two mod­els of what the AI is and what it’s try­ing to do: first, you could say the AI is a phys­i­cal pro­gram on a com­puter, which is try­ing to cause the phys­i­cal com­puter that the toy world is run­ning on to en­ter a cer­tain state. Alter­na­tively, you could say that the AI is an ab­stract com­pu­ta­tional pro­cess which is try­ing achieve cer­tain re­sults in an­other ab­stract com­pu­ta­tional pro­cess (the toy world) that it is in­ter­fac­ing with.

On the first view, if the AI is clever enough, it might figure out how to ma­nipu­late the out­side world, by, for in­stance, hack­ing into other com­put­ers to gain more com­put­ing power. On the sec­ond view, the out­side world is ir­rele­vant to the AI’s in­ter­ests, since chang­ing what’s run­ning on cer­tain phys­i­cal com­put­ers in the real world would have no effect on the ideal­ized com­pu­ta­tional model that the AI is op­ti­miz­ing over, so the AI has no in­cen­tive to op­ti­mize over our world.

AIs for which the sec­ond model is more ac­cu­rate seem gen­er­ally safer than AIs for which the first model is more ac­cu­rate. So try­ing to en­courage AI de­vel­op­ment to fol­low the sec­ond model could help de­lay the de­vel­op­ment of dan­ger­ous AGI.

AIs fol­low­ing this model are limited in some ways. For in­stance, they could not be used to figure out how to pre­vent the de­vel­op­ment of other dan­ger­ous AGI, since this re­quires rea­son­ing about what hap­pens in the real world.

But such AIs could still be quite use­ful for many things, such as en­g­ineer­ing. In or­der to use AIs op­ti­miz­ing over toy wor­lds to de­sign things that are use­ful in the real world, we could make the toy world have physics and ma­te­ri­als similar enough to our world that de­signs that work well in the toy world should be ex­pected to also work well in the real world. We then take the de­signs the AI builds in the toy world, and repli­cate them in the real world. If they don’t work in the real world, then we try to find the dis­crep­ancy be­tween real-world physics and toy-world physics that ac­counts for it, fix the dis­crep­ancy, and try again.

One pos­si­ble thing that could go catas­troph­i­cally wrong with this strat­egy is if the de­sign the AI comes up with has an agent in it. If the AI de­signs this agent to figure out what sort of world it’s in, rather than hard-cod­ing the agent to be­lieve it’s in the toy world that the origi­nal AI cares about, then this agent could, in the toy world, figure out the physics of the toy world and do some­thing sen­si­ble, mak­ing it look like the de­sign should work when we simu­late it in the toy world. But then when we repli­cate the de­sign in the real world, the agent that gets built with the de­sign no­tices that it’s in a big world with hu­mans that can be ma­nipu­lated, com­put­ers that can be hacked, and so on, and does those things in­stead of act­ing as ex­pected.

This prob­lem could be ad­dressed by try­ing to de­sign the AI in such a way that it would not come up with solu­tions that in­volve cre­at­ing agents, or by figur­ing out how to re­li­ably de­tect agents in de­signs, so that we know to re­ject those. An al­ter­na­tive ap­proach would be to de­sign the AI in such a way that it can cre­ate agents, but only agents that share the prop­erty that it only di­rects its op­ti­miza­tion to­wards the toy world and only builds agents with that same prop­erty.

De­sign­ing bet­ter AIs is an en­g­ineer­ing task that an AI along these lines could be used for. Since in this case, the ex­plicit pur­pose is cre­at­ing agents, it re­quires a solu­tion to the prob­lem of cre­at­ing agents that act in un­in­tended ways that does not in­volve not cre­at­ing any agents. In­stead, we would want to for­mal­ize this no­tion of op­ti­miz­ing only over toy mod­els, so that it can be used as a de­sign con­straint for the AIs that we’re ask­ing our AI to de­sign. If we can do this, then it would give us a pos­si­ble route to a con­trol­led in­tel­li­gence ex­plo­sion, in which the AI de­signs a more ca­pa­ble suc­ces­sor AI be­cause that is the task it has been as­signed, rather than for in­stru­men­tal rea­sons, and hu­mans can in­spect the re­sult and de­cide whether or not to run it.