Probabilistic Tiling (Preliminary Attempt)

We know a de­cent amount about how to get tiling in proof-based en­vi­ron­ments where the ob­jec­tive is for the AI to achieve a goal, or to write an AI to achieve a goal, or to write an AI that writes an AI that achieves a goal, and so on, as long as the chain of defer­ral is finite. The prob­a­bil­is­tic set­ting is far less ex­plored. This post will out­line some nec­es­sary con­di­tions to achieve tiling of an ex­pected util­ity max­i­mizer in fully gen­eral en­vi­ron­ments. Some of these con­di­tions are fulfilled by a log­i­cal in­duc­tor in the limit. It’s en­tirely pos­si­ble that there are se­ri­ous prob­lems with this proof, so I’m hop­ing that it gets a good kick­ing.

DEFINITIONS:

is an in­put­less com­pu­ta­tion that out­puts a bit­string, which will be in­ter­preted as a 2-tu­ple of an ac­tion, and an­other com­pu­ta­tion, an el­e­ment of .

is the n’th com­pu­ta­tion in the in­finite se­quence of com­pu­ta­tions defined by start­ing at and tak­ing the com­pu­ta­tion that each com­pu­ta­tion out­puts. When n=1, the su­per­script will be omit­ted.

is the ac­tion se­lected by .

is some prob­a­bil­ity dis­tri­bu­tion or some­thing that can be used to eval­u­ate com­pli­cated com­pu­ta­tions. Like the n’th stage of a log­i­cal in­duc­tor, al­though we aren’t nec­es­sar­ily as­sum­ing that it has to be a log­i­cal in­duc­tor.

un­der­lin­ing some­thing refers to re­plac­ing the sym­bol with the thing be­ing de­scribed. If the vari­able is used as a way of ab­stractly rep­re­sent­ing some ac­tion, then refers to the ac­tual ac­tion cho­sen by the policy.

is the util­ity func­tion, which is some large com­pu­ta­tion of type . So this would be the util­ity func­tion out­put when it is fed the past ac­tions gen­er­ated by start­ing at , the cur­rent ac­tion, and the in­finite se­quence of fu­ture ac­tions is given by what­ever se­quence of ac­tions is pro­duced by the self-mod­ifi­ca­tion chain start­ing with . Note that this only ex­plic­itly writes the start­ing code, and the code that might be mod­ified to, not the past or fu­ture ac­tion se­quence! This is im­por­tant for the agent to be able to rea­son about this com­pu­ta­tion, de­spite it tak­ing an in­finite in­put.

Given some se­quence of prob­a­bil­ity dis­tri­bu­tions, and some se­quence of sets of poli­cies, let be the computation

This is ba­si­cally max­i­miz­ing ex­pected util­ity, con­di­tional on the agent it­self tak­ing some ac­tion, with the past se­quence of ac­tions given by an ex­pected util­ity max­i­mizer on all pre­vi­ous steps.

Ac­tu­ally, in the proof, we will be as­sum­ing a 2-step pro­cess where some es­sen­tial things are proved, and then argmax is done. This is done so that the fu­ture prob­a­bil­ity dis­tri­bu­tion re­spects some im­por­tant in­for­ma­tion. The three im­por­tant bits of com­pu­ta­tion are that the ac­tion of the pre­vi­ous com­pu­ta­tion must be com­puted/​ob­served, that all the poli­cies in must be em­u­lated to know what policy and ac­tion they pro­duce, with all of the pro­duced poli­cies added to , and the proof in the next sec­tion must be done to en­sure that the con­di­tional prob­a­bil­ities work out ap­pro­pri­ately.

Now, time for the as­sump­tions.

The first as­sump­tion is that ab­stract ex­pected util­ity equals con­crete ex­pected util­ity.

This es­sen­tially says that if you’re go­ing to ac­tu­ally take ac­tion 4, then the ex­pected util­ity of “I take an ac­tion” had bet­ter be the same as the ex­pected util­ity of ac­tion 4. This is prob­a­bly fulfilled by a log­i­cal in­duc­tor, as it would be pos­si­ble to buy a con­di­tional share/​reg­u­lar share in the lower-priced util­ity com­pu­ta­tion, and sell a con­di­tional share/​reg­u­lar share in the higher-priced util­ity com­pu­ta­tion, and when the ac­tion is ac­tu­ally taken, the prov­able equiv­alence be­tween the ac­tion ac­tu­ally taken and the ac­tion in the ab­stract means that the prices con­verge to each other, and the the­o­rems in the log­i­cal in­duc­tion prob­a­bly en­able prof­it­ing off the pre­dictable fu­ture equal­ity.

One im­por­tant note is that CDT spec­tac­u­larly fails this prop­erty. Namely, con­sider a game of match­ing pen­nies against a pow­er­ful pre­dic­tor. Since the en­vi­ron­ment takes ac­tions as in­put, it’s pos­si­ble to re­com­pute what would have hap­pened if a differ­ent ac­tion is plugged in. The CDT agent that keeps los­ing is go­ing to learn to ran­dom­ize be­tween ac­tions since it keeps see­ing that the ac­tion it didn’t take would have done bet­ter. So it even­tu­ally gets to a state where it pre­dicts the re­ward from “pick heads” and “pick tails” is 0.5 (be­cause there’s a 50% chance it doesn’t pick heads/​tails), but it pre­dicts the re­ward from “I take an ac­tion” is 0, vi­o­lat­ing this as­sump­tion. EDT, how­ever, will pre­dict that pick­ing heads, pick­ing tails, and pick­ing some ac­tion all have util­ity 0, thus fufilling this con­di­tion.

The sec­ond as­sump­tion is that the en­vi­ron­ment is fair. Namely:

This says some­thing like “af­ter find­ing out that you take a par­tic­u­lar ac­tion and policy, that should be the same ex­pected util­ity as some other policy pick­ing the same ac­tion and suc­ces­sor policy”, or, more snap­pily, “find­ing out that you take a par­tic­u­lar ac­tion shouldn’t be bad news for the ex­pected util­ity of a par­tic­u­lar se­quence of ac­tions”. The match­ing pen­nies pre­dic­tor that is just pe­nal­iz­ing the se­quence of ac­tions se­lected by an argmaxer is un­fair in this sense for the argmaxer. This con­di­tion (or some­thing like it) seems nec­es­sary, in the sense that you can come up with ex­am­ples where a failure of this con­di­tion in­cen­tivizes mod­ify­ing your­self to a differ­ent way of se­lect­ing ac­tions.

The third as­sump­tion (which has already been stated), is that strat­egy steal­ing works. More speci­fi­cally, if , then , and is com­puted at time n. This is an­other nec­es­sary con­di­tion. Con­sider a fair en­vi­ron­ment that re­wards you when you do the ham­ster dance. You are pretty sure you re­mem­ber the steps, but aren’t com­pletely sure. There’s an­other policy that will make you cor­rectly perform the mo­tions. The ex­pected util­ity of “move your arm left” is differ­ent than the ex­pected util­ity of “ham­ster dance policy does its move”. How­ever, if you were able to com­pute what the ham­ster dance policy would out­put as its next move, you could just copy the move, with­out giv­ing up your argmax­i­ness prop­erty. There­fore, if you don’t know what move a policy out­puts, there may be an in­cen­tive to mod­ify to it, even if the en­vi­ron­ment is fair.

The fourth as­sump­tion is con­di­tional fu­ture-trust of the fol­low­ing form:

This is em­phat­i­cally false in gen­eral, but there’s a spe­cial con­di­tion that makes it vi­able, namely that the dis­tri­bu­tion at time n is guaran­teed to as­sign prob­a­bil­ity 1 to iff . My epistemic state about this is “this seems ex­tremely plau­si­ble, but I don’t know for sure if log­i­cal in­duc­tors at­tain this prop­erty in the limit”

PROOF:

We will rea­son from the per­spec­tive of the ex­pected util­ity max­i­mizer in the proof phase shortly be­fore turn n-1, and show that the fu­ture dis­tri­bu­tion that it trusts en­dorses ex­pected util­ity max­i­miza­tion over all other poli­cies, and use con­di­tional self-trust to move this con­clu­sion back to the cur­rent max­i­mizer to in­form its choice of which policy to pick.

To be­gin (we aren’t rea­son­ing from the per­spec­tive of the agent yet), as­sume that there is some pair that achieves a strictly higher ex­pected util­ity (ac­cord­ing to ) than all other pairs where the sec­ond el­e­ment is .

In par­tic­u­lar, there is some pair that achieves a higher ex­pected util­ity than .

Now, rea­son from the per­spec­tive of the agent, and take as a start­ing as­sump­tion that .

To be­gin with, by the defi­ni­tion of , and be­cause is in be­cause of the strat­egy-steal­ing as­sump­tion, then for what­ever ac­tion and policy that the argmax policy picks,

holds, be­cause of the defi­ni­tion of argmax. (Very im­por­tant note: this in­equal­ity isn’t nec­es­sar­ily true if we have ex­plo­ra­tion steps! Ex­plo­ra­tion steps pre­vent this tiling ar­gu­ment from go­ing through. How­ever, we re­quire well-defined con­di­tion­als for the rest of the rea­son­ing, which is at­tained by ex­plo­ra­tion steps, so there is a weird ten­sion here.)

Now, by the ar­gu­ment for the equal­ity of ab­stract and con­crete util­ity, if there’s some policy in par­tic­u­lar that is get­ting a high con­di­tional ex­pected util­ity, the con­di­tional ex­pected util­ity of “I self-mod­ify to what­ever I’m go­ing to self-mod­ify to” should be the same as the ex­pected util­ity of “I self-mod­ify to this par­tic­u­lar policy that looks re­ally good”.

Now, by as­sump­tion 1, about the equal­ity of ab­stract and con­crete util­ity,

And be­cause by as­sump­tion, and the fu­ture prob­a­bil­ity dis­tri­bu­tion will have seen that and as­signed it prob­a­bil­ity 1, (so con­di­tion­ing on it doesn’t change any­thing)

Now, we’ll be­gin work­ing with the other side of the in­equal­ity. By as­sump­tion 2, that the en­vi­ron­ment is fair,

Now, since the agent at time n has com­puted and proved that

And, by the pre­vi­ous ar­gu­ment about how the ac­tion taken is x and the fu­ture-agent knows it,

Now, be­cause, the agent has proved that

if the agent re­sponds to proofs of by hav­ing the ex­pected value of the LUV con­di­tional on be­ing then in the argmax es­ti­mat­ing phase,

And then, by lin­ear­ity of ex­pec­ta­tion,

and then, by as­sump­tion 4 about con­di­tional self-trust, we get that

And this con­tra­dicts the start­ing as­sump­tion.

There­fore, if all the as­sump­tions hold, for all timesteps, as well as stuff like lin­ear­ity of ex­pec­ta­tion un­der con­di­tion­als, and the argmax policy is first in the tie-break­ing or­der­ing, then if there’s an ex­pected-util­ity max­i­mizer on the first step, it will keep tiling to ex­pected util­ity max­i­miz­ers for­ever af­ter­wards. Note that this also makes the im­plicit as­sump­tion in the argmax that the past ac­tion se­quence was done by ex­pected util­ity max­i­miz­ers true.

ANALYSIS:

There are a few in­ter­est­ing fea­tures of this proof, even though it’s ul­ti­mately un­satis­fac­tory. One in­ter­est­ing fea­ture is that the ar­gu­ment fails when the agent has ex­plo­ra­tion steps. This feels tied to the is­sue where AIXI is op­ti­mal for the start­ing policy dis­tri­bu­tion, while AIXI+ex­plo­ra­tion is op­ti­mal for the start­ing policy dis­tri­bu­tion. The proof is dis­satis­fac­tory be­cause it heav­ily re­lies on ex­act equal­ities, which don’t hold for log­i­cal in­duc­tors (rather, the rele­vant prop­er­ties hold in the limit).

Another rea­son for dis­satis­fac­tion is that this is an EDT agent, and it doesn’t take into ac­count most of the more ex­otic de­ci­sion-the­ory stuff. In par­tic­u­lar, it’s quite un­clear what an agent of this type would do on Parfit’s Hitchiker, and I would at least like to gen­er­al­ize it to han­dle cases where the fu­ture ac­tion af­fects the prob­a­bil­ity of some ob­ser­va­tion in the past, even if it can’t quite at­tain the full force of UDT. There’s also the is­sue of the con­di­tional prob­a­bil­ities be­com­ing in­creas­ingly non­sen­si­cal (and self-fulfilling prophe­cies show­ing up, such as the cos­mic ray prob­lem) if we throw out ex­plo­ra­tion steps. After all, the nice prop­er­ties in the log­i­cal in­duc­tor pa­per only ap­ply to se­quences of ac­tions where the prob­a­bil­ity ap­proaches 0 slowly enough that the sum of prob­a­bil­ities across turns limits to in­finity (such as the har­monic se­quence). Fi­nally, I’m un­sure what this does on clas­sic prob­lems such as the pro­cras­ti­na­tion para­dox where you have to avoid press­ing a but­ton for as many turns as you can, but still press it even­tu­ally, and I’d be in­ter­ested in an­a­lyz­ing that.

There are still some in­ter­est­ing les­sons that can be taken from this, how­ever. Namely, that the abil­ity to steal the ac­tion of an­other policy is es­sen­tial to show that you won’t tile to it (be­cause you can just copy the ac­tion while keep­ing your de­ci­sion-mak­ing ar­chi­tec­ture in­tact, while oth­er­wise, fair en­vi­ron­ments can have in­cen­tives to self-mod­ify). Also, the ex­pected util­ity of “I take an ac­tion” be­ing equal to the ex­pected util­ity of “I take this spe­cific ac­tion that looks like the best so far” seems to be un­usu­ally im­por­tant, be­cause in con­junc­tion with fu­ture-trust, it al­lows work­ing around the core prob­lem of cur­rent you not know­ing what fu­ture-you will do.

Look­ing at the prob­lems from Log­i­cal In­duc­tor Tiling and Why It’s Hard, we can make the fol­low­ing re­cap:

The first prob­lem of CDT suck­ing can be over­come by con­di­tion­ing on “I take this ac­tion”

The sec­ond prob­lem of be­ing bad at con­sid­er­ing se­quences of ac­tions can be over­come by just look­ing one step ahead, and if you want to get cer­tain se­quences of ac­tions that look good, you just need to strat­egy-steal at each step from a policy that en­acts that se­quence of ac­tions, and this also has the nice prop­erty of pre­serv­ing the free­dom of what your ac­tion will be in the fu­ture if fu­ture-you thinks that some other ac­tion se­quence is ac­tu­ally bet­ter.

The third prob­lem is solved by the solu­tion pre­sented in the linked post.

The fourth prob­lem of for­get­ting in­for­ma­tion isn’t re­ally solved head-on, be­cause the fu­ture-trust prop­erty prob­a­bly fails if the fu­ture-agent throws away in­for­ma­tion. How­ever, it’s highly sug­ges­tive that this tiling re­sult just in­volves an ab­stract se­quence of past ac­tions, in­stead of un­pack­ing them into an ex­plicit se­quence of past ac­tions, and also this tiling re­sult just looks one step ahead in­stead of con­sid­er­ing a far-fu­ture ver­sion of it­self.

The fifth prob­lem of hav­ing strate­gies that re­act to the en­vi­ron­ment isn’t re­ally solved, but it does man­age to get around the prob­lem of not know­ing what some other al­gorithm will even­tu­ally do. All that is re­ally needed is that the fu­ture ver­sion of the agent knows what the other al­gorithm do so it can strat­egy-steal. This is still not a fully satis­fac­tory solu­tion, since there’s one policy that’s clearly more pow­er­ful than the rest of them, but it still seems like sub­stan­tial progress has been made.

The sixth prob­lem isn’t re­ally a fix­able prob­lem, it’s just a situ­a­tion where the agent has an in­cen­tive to mod­ify its policy to some­thing else be­cause it vi­o­lates the fair­ness con­di­tion.

The sev­enth prob­lem about be­ing un­able to ex­per­i­ment with the con­se­quences of a code rewrite is par­tially solved by as­sum­ing that the en­vi­ron­ment only cares about ac­tions, and par­tially solved by con­di­tion­ing on ac­tions in­stead of “I pick this policy” (this con­di­tional is higher prob­a­bil­ity)

Prob­lem 8 is dodged, and is pretty much the same as the fifth prob­lem.

Prob­lem 9 is not ad­dressed, be­cause this tiling proof as­sumes ex­act equal­ities, and doesn’t deal with the noise from log­i­cal in­duc­tors not be­ing perfect im­pairing suffi­ciently long chains of trust. How­ever, the fact that the proof only looks one step ahead feels promis­ing for not hav­ing com­pound­ing noise over time break­ing things.