Monk Tree­house: some prob­lems de­fin­ing simulation

When does one pro­gram sim­u­late an­other? When does one pro­gram ap­prox­im­ately sim­u­late an­other?

In some con­texts, these ques­tions have con­crete an­swers. In other con­texts they feel about the same as the prob­lem of func­tion­al­ist tri­vi­al­ity. When sim­u­la­tion comes up in con­texts rel­ev­ant to agent found­a­tions, it tends to feel like the lat­ter. Here, I out­line some ar­gu­ments which I’ve used in the past to show that a given concept of sim­u­la­tion doesn’t quite ac­com­plish what it sets out to do.

For sake of con­crete­ness, I will give two prob­lems which we can ima­gine solv­ing with an ap­pro­pri­ate defin­i­tion of sim­u­la­tion.

First, con­sider the prob­lem of tak­ing coun­ter­fac­tu­als in or­der to eval­u­ate con­sequences of our ac­tions. We may wish to seek cop­ies of our own al­gorithm in the en­vir­on­ment in or­der to coun­ter­fact on them (or to make a mere as­sump­tion that their out­put will be the same as ours). Here, I as­sume we’re try­ing to solve this prob­lem by look­ing for sub-com­pon­ents (or prop­er­ties) of some world model which meet some cri­terion; namely, form­ing a sim­u­la­tion of us, ac­cord­ing to some formal defin­i­tion. We’ve already de­cided on a pro­gram P which rep­res­ents us (pos­sibly more than one, but we take in­terest in one at at time), and we are com­par­ing P to (our mod­els of) the en­vir­on­ment in some sense. We ex­pect to find one copy of P — ourselves — and we are look­ing for oth­ers.

As a second ex­ample, sup­pose that we have in hand a math­em­at­ical de­scrip­tion of an AI which we are con­sid­er­ing build­ing. Sup­pose that the AI has vari­ous safety and trans­par­ency prop­er­ties, but all these prop­er­ties are stated in terms of the agent’s be­liefs and ac­tions, where be­liefs are stored in some spe­cified format and ac­tions are out­put via a spe­cified chan­nel. We are in­ter­ested in know­ing whether build­ing the agent will lead to any “emer­gent sub­agents”, and if so, whether those agents will be aligned with the ori­ginal agent (and with us), and what safety and trans­par­ency prop­er­ties those agents might have. In pur­suit of this goal, we are at­tempt­ing to prove the­or­ems about whether cer­tain pro­grams P rep­res­ent­ing sub­agents can be sim­u­lated on cer­tain sub-com­pon­ents (or prop­er­ties) of our AI.

A Strawman

Again for sake of con­crete­ness, I’ll put for­ward a straw defin­i­tion of sim­u­la­tion. Sup­pose the pro­gram P we’re search­ing for sim­u­la­tions of is rep­res­en­ted as a Tur­ing ma­chine, and we en­rich it with an in­tu­it­ive set of coun­ter­fac­tu­als (mov­ing the head, chan­ging the state, chan­ging sym­bols any­where on the tape; all of this at any time step). We thus get a causal graph (the pro­gram graph), in which we think of some ver­tices as in­put and some as out­put. I as­sume our model of the world (in the first ex­ample) or the AI (in the second) is sim­il­arly made up of com­pon­ents which have states and ad­mit fairly straight­for­ward local coun­ter­fac­tu­als, such as re­mov­ing cer­tain groups of atoms or chan­ging the con­tents of cer­tain re­gisters. We can thus think of the en­vir­on­ment or the AI as a causal graph (the tar­get graph). We claim some part of the tar­get graph sim­u­lates the pro­gram graph if there is a map­ping between the two which suc­cess­fully rep­res­ents all the coun­ter­fac­tu­als. Since the tar­get graph may be highly re­dund­ant, we al­low single ver­tices in the pro­gram graph to cor­res­pond to col­lec­tions of ver­tices in the tar­get graph, and we sim­il­arly al­low single states in the pro­gram graph to cor­res­pond to mul­tiple pos­sible states in the tar­get graph. (Ima­gine a col­lec­tion of voltages which all map to “0”.)

This defin­i­tion likely has many prob­lems. For ex­ample, in a Tur­ing ma­chine vir­tu­ally all caus­a­tion has to move through the read/​write head, and we have coun­ter­fac­tu­als which can move the head. We prob­ably don’t care if the en­vir­on­ment im­ple­ments the com­pu­ta­tion in a way which runs all caus­a­tion through one point like this. This defin­i­tion of sim­u­la­tion may also be trivial (in the Func­tion­al­ist Tri­vi­al­ity sense) when the tar­get graph is large enough and rich enough; a tar­get graph with enough coun­ter­fac­tu­als will con­tain some of the right shape to match the pro­gram graph. For the sake of this post, just ima­gine someone has taken this defin­i­tion as a start­ing point and at­temp­ted to re­move some ob­vi­ous prob­lems. (The ar­gu­ments I’m present­ing were ori­gin­ally replies to a couple dif­fer­ent pro­pos­als of this gen­eral type.)

Monk Treehouse

An or­der of monks has ded­ic­ated them­selves to cal­cu­lat­ing the pro­gram P. The monks live in a gi­gantic tree­house, and have con­struc­ted a large, aba­cus-like com­puter on which to run P. Large ce­re­mo­nial stone weights are moved from one place to an­other in or­der to ex­ecute the al­gorithm. The weights are so heavy that mov­ing any one of them could cause the tree­house to be­come un­bal­anced and fall down. In or­der to do their com­pu­ta­tion, the monks have come up with a bal­an­cing sys­tem. As a first ex­ample, I’ll sup­pose that these monks came up with a second pro­gram P’ when they first built the tree­house, and P’ is ex­ecuted on a sep­ar­ate set of stone weights on the other side of the tree­house. The pro­gram P’ is dif­fer­ent from P, but the monks were able to prove a the­orem guar­an­tee­ing that the dis­tri­bu­tions of weights would bal­ance.

In­tu­it­ively, the monks on one side of the tree­house are sim­u­lat­ing the pro­gram P — it’s ac­tu­ally what they set out to do. But none of the coun­ter­fac­tu­als which we want to take on the pro­gram graph map eas­ily to the tree­house weights. If we move a weight on one side to cor­res­pond to edit­ing a sym­bol on the Tur­ing ma­chine’s tape, then the tree­house falls over. If we move some weight on the other side to keep things bal­anced, the coun­ter­fac­tual may tem­por­ar­ily suc­ceed, but the pro­gram P’ will stop obey­ing the monks’ fore­or­dained guar­an­tee; es­sen­tially P’ may get thrown off course and fail to coun­ter­bal­ance the next move cor­rectly, des­troy­ing the tree­house. If we just add hid­den weights to coun­ter­bal­ance our change, the hid­den weights could be­come wrong later.

When such a pro­gram P’ ex­ists, the tree­house ar­gu­ment is ba­sic­ally point­ing out that the “straight­for­ward local coun­ter­fac­tu­als” we put in the tar­get graph were not enough; there can be “lo­gical en­tan­gle­ments” which keep us from suc­cess­fully point­ing at the em­bed­ding of P.

Of course, we could in­clude in our map­ping between pro­gram graph and tar­get graph that cor­res­pond­ing to any coun­ter­fac­tual part of the pro­gram graph, we cre­ate a phys­ical struc­ture which props up the tree within the tar­get graph. This is be­gin­ning to be­come sus­pi­cious since the states used for coun­ter­fac­tu­als look so much dif­fer­ent than the states cor­res­pond­ing to a non-coun­ter­fac­ted run of the pro­gram. One thing we want to avoid is hav­ing a no­tion of sim­u­la­tion which could con­sider an empty patch of air to be com­put­ing P, simply by map­ping coun­ter­fac­tu­als to a case where a com­puter sud­denly ap­pears. (Such a prob­lem would be even worse than the usual func­tion­al­ist tri­vi­al­ity prob­lem.) When we say that a state of some part of the Tur­ing ma­chine cor­res­ponds to some state in the tar­get graph, we should mean the same thing whether we’re in a coun­ter­fac­tual part of the graphs or not.

The “com­puter from thin air” prob­lem ac­tu­ally seems very sim­ilar to the Monk Tree­house prob­lem. One way of try­ing to in­tu­it­ively ar­gue that the monks are ex­ecut­ing the pro­gram P would be to sup­pose that all the monks switch to us­ing light­weight tokens to do their com­pu­ta­tion in­stead of the heavy stones. This seems to be al­most as thor­ough a change as mak­ing a coun­ter­fac­tual com­puter in empty air. Yet it seems to me that the monk tree­house is really com­put­ing P, whereas the air is not.

Sup­pose we have a defin­i­tion of sim­u­la­tion which can handle this; that is, some­how the tree­house doesn’t fall over un­der coun­ter­fac­tu­als but the no­tion of sim­u­la­tion doesn’t build com­puters of thin air. It seems to me there are still fur­ther com­plic­a­tions along the same lines. These monks are good math­em­aticians, and have de­voted their lives to the faith­ful ex­e­cu­tion of the pro­gram P. One can only ima­gine that they will be keep­ing care­ful re­cords and double check­ing that every stone is moved cor­rectly. They also will look for any math­em­at­ical prop­er­ties of the pro­cess which can be proven and used to double check. When we at­tempt to map coun­ter­fac­tual Tur­ing ma­chine states to coun­ter­fac­tual stone po­s­i­tions, we will be pinned down by a com­plic­ated veri­fic­a­tion pro­cess;. If we just move the stone, the monks will no­tice it and move it back; if we try to change the monks’ be­liefs too, then they will no­tice con­tra­dic­tions in their math­em­at­ics, lose faith, and aban­don the cal­cu­la­tions.

What might work would be to find every single case where a monk cal­cu­lates the stone po­s­i­tion, both be­fore and after the ac­tual stone move­ment, and map coun­ter­fac­tu­als on the pro­gram graph to a coun­ter­fac­tual where every single one of these cal­cu­la­tions is changed. This likely has prob­lems of its own, but it’s in­ter­est­ing as a sug­ges­tion that in or­der to find one sim­u­la­tion of the pro­gram P, you need to track down ba­sic­ally all sim­u­la­tions of P (or at least all the ones that might in­ter­act).

To make my ana­logy clear: both emer­gent sub­agents within an AI and in­stances of one’s de­cision al­gorithm in the en­vir­on­ment could be em­bed­ded in their sur­round­ings in ways not eas­ily teased out by “straight­for­ward local coun­ter­fac­tu­als”. Non­ethe­less, the no­tion of sim­u­la­tion here seems in­tu­it­ive, and I can’t help but ima­gine there’s an ap­pro­pri­ate defin­i­tion some­where just bey­ond ones I’ve seen or tried. I’d be happy to look at any sug­ges­tions.