Monk Treehouse: some problems defining simulation

When does one pro­gram simu­late an­other? When does one pro­gram ap­prox­i­mately simu­late an­other?

In some con­texts, these ques­tions have con­crete an­swers. In other con­texts they feel about the same as the prob­lem of func­tion­al­ist triv­ial­ity. When simu­la­tion comes up in con­texts rele­vant to agent foun­da­tions, it tends to feel like the lat­ter. Here, I out­line some ar­gu­ments which I’ve used in the past to show that a given con­cept of simu­la­tion doesn’t quite ac­com­plish what it sets out to do.

For sake of con­crete­ness, I will give two prob­lems which we can imag­ine solv­ing with an ap­pro­pri­ate defi­ni­tion of simu­la­tion.

First, con­sider the prob­lem of tak­ing coun­ter­fac­tu­als in or­der to eval­u­ate con­se­quences of our ac­tions. We may wish to seek copies of our own al­gorithm in the en­vi­ron­ment in or­der to coun­ter­fact on them (or to make a mere as­sump­tion that their out­put will be the same as ours). Here, I as­sume we’re try­ing to solve this prob­lem by look­ing for sub-com­po­nents (or prop­er­ties) of some world model which meet some crite­rion; namely, form­ing a simu­la­tion of us, ac­cord­ing to some for­mal defi­ni­tion. We’ve already de­cided on a pro­gram P which rep­re­sents us (pos­si­bly more than one, but we take in­ter­est in one at at time), and we are com­par­ing P to (our mod­els of) the en­vi­ron­ment in some sense. We ex­pect to find one copy of P — our­selves — and we are look­ing for oth­ers.

As a sec­ond ex­am­ple, sup­pose that we have in hand a math­e­mat­i­cal de­scrip­tion of an AI which we are con­sid­er­ing build­ing. Sup­pose that the AI has var­i­ous safety and trans­parency prop­er­ties, but all these prop­er­ties are stated in terms of the agent’s be­liefs and ac­tions, where be­liefs are stored in some speci­fied for­mat and ac­tions are out­put via a speci­fied chan­nel. We are in­ter­ested in know­ing whether build­ing the agent will lead to any “emer­gent sub­agents”, and if so, whether those agents will be al­igned with the origi­nal agent (and with us), and what safety and trans­parency prop­er­ties those agents might have. In pur­suit of this goal, we are at­tempt­ing to prove the­o­rems about whether cer­tain pro­grams P rep­re­sent­ing sub­agents can be simu­lated on cer­tain sub-com­po­nents (or prop­er­ties) of our AI.

A Strawman

Again for sake of con­crete­ness, I’ll put for­ward a straw defi­ni­tion of simu­la­tion. Sup­pose the pro­gram P we’re search­ing for simu­la­tions of is rep­re­sented as a Tur­ing ma­chine, and we en­rich it with an in­tu­itive set of coun­ter­fac­tu­als (mov­ing the head, chang­ing the state, chang­ing sym­bols any­where on the tape; all of this at any time step). We thus get a causal graph (the pro­gram graph), in which we think of some ver­tices as in­put and some as out­put. I as­sume our model of the world (in the first ex­am­ple) or the AI (in the sec­ond) is similarly made up of com­po­nents which have states and ad­mit fairly straight­for­ward lo­cal coun­ter­fac­tu­als, such as re­mov­ing cer­tain groups of atoms or chang­ing the con­tents of cer­tain reg­isters. We can thus think of the en­vi­ron­ment or the AI as a causal graph (the tar­get graph). We claim some part of the tar­get graph simu­lates the pro­gram graph if there is a map­ping be­tween the two which suc­cess­fully rep­re­sents all the coun­ter­fac­tu­als. Since the tar­get graph may be highly re­dun­dant, we al­low sin­gle ver­tices in the pro­gram graph to cor­re­spond to col­lec­tions of ver­tices in the tar­get graph, and we similarly al­low sin­gle states in the pro­gram graph to cor­re­spond to mul­ti­ple pos­si­ble states in the tar­get graph. (Imag­ine a col­lec­tion of voltages which all map to “0”.)

This defi­ni­tion likely has many prob­lems. For ex­am­ple, in a Tur­ing ma­chine vir­tu­ally all cau­sa­tion has to move through the read/​write head, and we have coun­ter­fac­tu­als which can move the head. We prob­a­bly don’t care if the en­vi­ron­ment im­ple­ments the com­pu­ta­tion in a way which runs all cau­sa­tion through one point like this. This defi­ni­tion of simu­la­tion may also be triv­ial (in the Func­tion­al­ist Triv­ial­ity sense) when the tar­get graph is large enough and rich enough; a tar­get graph with enough coun­ter­fac­tu­als will con­tain some of the right shape to match the pro­gram graph. For the sake of this post, just imag­ine some­one has taken this defi­ni­tion as a start­ing point and at­tempted to re­move some ob­vi­ous prob­lems. (The ar­gu­ments I’m pre­sent­ing were origi­nally replies to a cou­ple differ­ent pro­pos­als of this gen­eral type.)

Monk Treehouse

An or­der of monks has ded­i­cated them­selves to calcu­lat­ing the pro­gram P. The monks live in a gi­gan­tic tree­house, and have con­structed a large, aba­cus-like com­puter on which to run P. Large cer­e­mo­nial stone weights are moved from one place to an­other in or­der to ex­e­cute the al­gorithm. The weights are so heavy that mov­ing any one of them could cause the tree­house to be­come un­bal­anced and fall down. In or­der to do their com­pu­ta­tion, the monks have come up with a bal­anc­ing sys­tem. As a first ex­am­ple, I’ll sup­pose that these monks came up with a sec­ond pro­gram P’ when they first built the tree­house, and P’ is ex­e­cuted on a sep­a­rate set of stone weights on the other side of the tree­house. The pro­gram P’ is differ­ent from P, but the monks were able to prove a the­o­rem guaran­tee­ing that the dis­tri­bu­tions of weights would bal­ance.

In­tu­itively, the monks on one side of the tree­house are simu­lat­ing the pro­gram P — it’s ac­tu­ally what they set out to do. But none of the coun­ter­fac­tu­als which we want to take on the pro­gram graph map eas­ily to the tree­house weights. If we move a weight on one side to cor­re­spond to edit­ing a sym­bol on the Tur­ing ma­chine’s tape, then the tree­house falls over. If we move some weight on the other side to keep things bal­anced, the coun­ter­fac­tual may tem­porar­ily suc­ceed, but the pro­gram P’ will stop obey­ing the monks’ fore­or­dained guaran­tee; es­sen­tially P’ may get thrown off course and fail to coun­ter­bal­ance the next move cor­rectly, de­stroy­ing the tree­house. If we just add hid­den weights to coun­ter­bal­ance our change, the hid­den weights could be­come wrong later.

When such a pro­gram P’ ex­ists, the tree­house ar­gu­ment is ba­si­cally point­ing out that the “straight­for­ward lo­cal coun­ter­fac­tu­als” we put in the tar­get graph were not enough; there can be “log­i­cal en­tan­gle­ments” which keep us from suc­cess­fully point­ing at the em­bed­ding of P.

Of course, we could in­clude in our map­ping be­tween pro­gram graph and tar­get graph that cor­re­spond­ing to any coun­ter­fac­tual part of the pro­gram graph, we cre­ate a phys­i­cal struc­ture which props up the tree within the tar­get graph. This is be­gin­ning to be­come sus­pi­cious since the states used for coun­ter­fac­tu­als look so much differ­ent than the states cor­re­spond­ing to a non-coun­ter­facted run of the pro­gram. One thing we want to avoid is hav­ing a no­tion of simu­la­tion which could con­sider an empty patch of air to be com­put­ing P, sim­ply by map­ping coun­ter­fac­tu­als to a case where a com­puter sud­denly ap­pears. (Such a prob­lem would be even worse than the usual func­tion­al­ist triv­ial­ity prob­lem.) When we say that a state of some part of the Tur­ing ma­chine cor­re­sponds to some state in the tar­get graph, we should mean the same thing whether we’re in a coun­ter­fac­tual part of the graphs or not.

The “com­puter from thin air” prob­lem ac­tu­ally seems very similar to the Monk Tree­house prob­lem. One way of try­ing to in­tu­itively ar­gue that the monks are ex­e­cut­ing the pro­gram P would be to sup­pose that all the monks switch to us­ing lightweight to­kens to do their com­pu­ta­tion in­stead of the heavy stones. This seems to be al­most as thor­ough a change as mak­ing a coun­ter­fac­tual com­puter in empty air. Yet it seems to me that the monk tree­house is re­ally com­put­ing P, whereas the air is not.

Sup­pose we have a defi­ni­tion of simu­la­tion which can han­dle this; that is, some­how the tree­house doesn’t fall over un­der coun­ter­fac­tu­als but the no­tion of simu­la­tion doesn’t build com­put­ers of thin air. It seems to me there are still fur­ther com­pli­ca­tions along the same lines. Th­ese monks are good math­e­mat­i­ci­ans, and have de­voted their lives to the faith­ful ex­e­cu­tion of the pro­gram P. One can only imag­ine that they will be keep­ing care­ful records and dou­ble check­ing that ev­ery stone is moved cor­rectly. They also will look for any math­e­mat­i­cal prop­er­ties of the pro­cess which can be proven and used to dou­ble check. When we at­tempt to map coun­ter­fac­tual Tur­ing ma­chine states to coun­ter­fac­tual stone po­si­tions, we will be pinned down by a com­pli­cated ver­ifi­ca­tion pro­cess;. If we just move the stone, the monks will no­tice it and move it back; if we try to change the monks’ be­liefs too, then they will no­tice con­tra­dic­tions in their math­e­mat­ics, lose faith, and aban­don the calcu­la­tions.

What might work would be to find ev­ery sin­gle case where a monk calcu­lates the stone po­si­tion, both be­fore and af­ter the ac­tual stone move­ment, and map coun­ter­fac­tu­als on the pro­gram graph to a coun­ter­fac­tual where ev­ery sin­gle one of these calcu­la­tions is changed. This likely has prob­lems of its own, but it’s in­ter­est­ing as a sug­ges­tion that in or­der to find one simu­la­tion of the pro­gram P, you need to track down ba­si­cally all simu­la­tions of P (or at least all the ones that might in­ter­act).

To make my anal­ogy clear: both emer­gent sub­agents within an AI and in­stances of one’s de­ci­sion al­gorithm in the en­vi­ron­ment could be em­bed­ded in their sur­round­ings in ways not eas­ily teased out by “straight­for­ward lo­cal coun­ter­fac­tu­als”. Nonethe­less, the no­tion of simu­la­tion here seems in­tu­itive, and I can’t help but imag­ine there’s an ap­pro­pri­ate defi­ni­tion some­where just be­yond ones I’ve seen or tried. I’d be happy to look at any sug­ges­tions.