# Probabilistic Tiling (Preliminary Attempt)

Contents

We know a de­cent amount about how to get tiling in proof-based en­vi­ron­ments where the ob­jec­tive is for the AI to achieve a goal, or to write an AI to achieve a goal, or to write an AI that writes an AI that achieves a goal, and so on, as long as the chain of defer­ral is finite. The prob­a­bil­is­tic set­ting is far less ex­plored. This post will out­line some nec­es­sary con­di­tions to achieve tiling of an ex­pected util­ity max­i­mizer in fully gen­eral en­vi­ron­ments. Some of these con­di­tions are fulfilled by a log­i­cal in­duc­tor in the limit. It’s en­tirely pos­si­ble that there are se­ri­ous prob­lems with this proof, so I’m hop­ing that it gets a good kick­ing.

## DEFINITIONS:

is an in­put­less com­pu­ta­tion that out­puts a bit­string, which will be in­ter­preted as a 2-tu­ple of an ac­tion, and an­other com­pu­ta­tion, an el­e­ment of .

is the n’th com­pu­ta­tion in the in­finite se­quence of com­pu­ta­tions defined by start­ing at and tak­ing the com­pu­ta­tion that each com­pu­ta­tion out­puts. When n=1, the su­per­script will be omit­ted.

is the ac­tion se­lected by .

is some prob­a­bil­ity dis­tri­bu­tion or some­thing that can be used to eval­u­ate com­pli­cated com­pu­ta­tions. Like the n’th stage of a log­i­cal in­duc­tor, al­though we aren’t nec­es­sar­ily as­sum­ing that it has to be a log­i­cal in­duc­tor.

un­der­lin­ing some­thing refers to re­plac­ing the sym­bol with the thing be­ing de­scribed. If the vari­able is used as a way of ab­stractly rep­re­sent­ing some ac­tion, then refers to the ac­tual ac­tion cho­sen by the policy.

is the util­ity func­tion, which is some large com­pu­ta­tion of type . So this would be the util­ity func­tion out­put when it is fed the past ac­tions gen­er­ated by start­ing at , the cur­rent ac­tion, and the in­finite se­quence of fu­ture ac­tions is given by what­ever se­quence of ac­tions is pro­duced by the self-mod­ifi­ca­tion chain start­ing with . Note that this only ex­plic­itly writes the start­ing code, and the code that might be mod­ified to, not the past or fu­ture ac­tion se­quence! This is im­por­tant for the agent to be able to rea­son about this com­pu­ta­tion, de­spite it tak­ing an in­finite in­put.

Given some se­quence of prob­a­bil­ity dis­tri­bu­tions, and some se­quence of sets of poli­cies, let be the computation

This is ba­si­cally max­i­miz­ing ex­pected util­ity, con­di­tional on the agent it­self tak­ing some ac­tion, with the past se­quence of ac­tions given by an ex­pected util­ity max­i­mizer on all pre­vi­ous steps.

Ac­tu­ally, in the proof, we will be as­sum­ing a 2-step pro­cess where some es­sen­tial things are proved, and then argmax is done. This is done so that the fu­ture prob­a­bil­ity dis­tri­bu­tion re­spects some im­por­tant in­for­ma­tion. The three im­por­tant bits of com­pu­ta­tion are that the ac­tion of the pre­vi­ous com­pu­ta­tion must be com­puted/​ob­served, that all the poli­cies in must be em­u­lated to know what policy and ac­tion they pro­duce, with all of the pro­duced poli­cies added to , and the proof in the next sec­tion must be done to en­sure that the con­di­tional prob­a­bil­ities work out ap­pro­pri­ately.

Now, time for the as­sump­tions.

The first as­sump­tion is that ab­stract ex­pected util­ity equals con­crete ex­pected util­ity.

This es­sen­tially says that if you’re go­ing to ac­tu­ally take ac­tion 4, then the ex­pected util­ity of “I take an ac­tion” had bet­ter be the same as the ex­pected util­ity of ac­tion 4. This is prob­a­bly fulfilled by a log­i­cal in­duc­tor, as it would be pos­si­ble to buy a con­di­tional share/​reg­u­lar share in the lower-priced util­ity com­pu­ta­tion, and sell a con­di­tional share/​reg­u­lar share in the higher-priced util­ity com­pu­ta­tion, and when the ac­tion is ac­tu­ally taken, the prov­able equiv­alence be­tween the ac­tion ac­tu­ally taken and the ac­tion in the ab­stract means that the prices con­verge to each other, and the the­o­rems in the log­i­cal in­duc­tion prob­a­bly en­able prof­it­ing off the pre­dictable fu­ture equal­ity.

One im­por­tant note is that CDT spec­tac­u­larly fails this prop­erty. Namely, con­sider a game of match­ing pen­nies against a pow­er­ful pre­dic­tor. Since the en­vi­ron­ment takes ac­tions as in­put, it’s pos­si­ble to re­com­pute what would have hap­pened if a differ­ent ac­tion is plugged in. The CDT agent that keeps los­ing is go­ing to learn to ran­dom­ize be­tween ac­tions since it keeps see­ing that the ac­tion it didn’t take would have done bet­ter. So it even­tu­ally gets to a state where it pre­dicts the re­ward from “pick heads” and “pick tails” is 0.5 (be­cause there’s a 50% chance it doesn’t pick heads/​tails), but it pre­dicts the re­ward from “I take an ac­tion” is 0, vi­o­lat­ing this as­sump­tion. EDT, how­ever, will pre­dict that pick­ing heads, pick­ing tails, and pick­ing some ac­tion all have util­ity 0, thus fufilling this con­di­tion.

The sec­ond as­sump­tion is that the en­vi­ron­ment is fair. Namely:

This says some­thing like “af­ter find­ing out that you take a par­tic­u­lar ac­tion and policy, that should be the same ex­pected util­ity as some other policy pick­ing the same ac­tion and suc­ces­sor policy”, or, more snap­pily, “find­ing out that you take a par­tic­u­lar ac­tion shouldn’t be bad news for the ex­pected util­ity of a par­tic­u­lar se­quence of ac­tions”. The match­ing pen­nies pre­dic­tor that is just pe­nal­iz­ing the se­quence of ac­tions se­lected by an argmaxer is un­fair in this sense for the argmaxer. This con­di­tion (or some­thing like it) seems nec­es­sary, in the sense that you can come up with ex­am­ples where a failure of this con­di­tion in­cen­tivizes mod­ify­ing your­self to a differ­ent way of se­lect­ing ac­tions.

The third as­sump­tion (which has already been stated), is that strat­egy steal­ing works. More speci­fi­cally, if , then , and is com­puted at time n. This is an­other nec­es­sary con­di­tion. Con­sider a fair en­vi­ron­ment that re­wards you when you do the ham­ster dance. You are pretty sure you re­mem­ber the steps, but aren’t com­pletely sure. There’s an­other policy that will make you cor­rectly perform the mo­tions. The ex­pected util­ity of “move your arm left” is differ­ent than the ex­pected util­ity of “ham­ster dance policy does its move”. How­ever, if you were able to com­pute what the ham­ster dance policy would out­put as its next move, you could just copy the move, with­out giv­ing up your argmax­i­ness prop­erty. There­fore, if you don’t know what move a policy out­puts, there may be an in­cen­tive to mod­ify to it, even if the en­vi­ron­ment is fair.

The fourth as­sump­tion is con­di­tional fu­ture-trust of the fol­low­ing form:

This is em­phat­i­cally false in gen­eral, but there’s a spe­cial con­di­tion that makes it vi­able, namely that the dis­tri­bu­tion at time n is guaran­teed to as­sign prob­a­bil­ity 1 to iff . My epistemic state about this is “this seems ex­tremely plau­si­ble, but I don’t know for sure if log­i­cal in­duc­tors at­tain this prop­erty in the limit”

## PROOF:

We will rea­son from the per­spec­tive of the ex­pected util­ity max­i­mizer in the proof phase shortly be­fore turn n-1, and show that the fu­ture dis­tri­bu­tion that it trusts en­dorses ex­pected util­ity max­i­miza­tion over all other poli­cies, and use con­di­tional self-trust to move this con­clu­sion back to the cur­rent max­i­mizer to in­form its choice of which policy to pick.

To be­gin (we aren’t rea­son­ing from the per­spec­tive of the agent yet), as­sume that there is some pair that achieves a strictly higher ex­pected util­ity (ac­cord­ing to ) than all other pairs where the sec­ond el­e­ment is .

In par­tic­u­lar, there is some pair that achieves a higher ex­pected util­ity than .

Now, rea­son from the per­spec­tive of the agent, and take as a start­ing as­sump­tion that .

To be­gin with, by the defi­ni­tion of , and be­cause is in be­cause of the strat­egy-steal­ing as­sump­tion, then for what­ever ac­tion and policy that the argmax policy picks,

holds, be­cause of the defi­ni­tion of argmax. (Very im­por­tant note: this in­equal­ity isn’t nec­es­sar­ily true if we have ex­plo­ra­tion steps! Ex­plo­ra­tion steps pre­vent this tiling ar­gu­ment from go­ing through. How­ever, we re­quire well-defined con­di­tion­als for the rest of the rea­son­ing, which is at­tained by ex­plo­ra­tion steps, so there is a weird ten­sion here.)

Now, by the ar­gu­ment for the equal­ity of ab­stract and con­crete util­ity, if there’s some policy in par­tic­u­lar that is get­ting a high con­di­tional ex­pected util­ity, the con­di­tional ex­pected util­ity of “I self-mod­ify to what­ever I’m go­ing to self-mod­ify to” should be the same as the ex­pected util­ity of “I self-mod­ify to this par­tic­u­lar policy that looks re­ally good”.

Now, by as­sump­tion 1, about the equal­ity of ab­stract and con­crete util­ity,

And be­cause by as­sump­tion, and the fu­ture prob­a­bil­ity dis­tri­bu­tion will have seen that and as­signed it prob­a­bil­ity 1, (so con­di­tion­ing on it doesn’t change any­thing)

Now, we’ll be­gin work­ing with the other side of the in­equal­ity. By as­sump­tion 2, that the en­vi­ron­ment is fair,

Now, since the agent at time n has com­puted and proved that

And, by the pre­vi­ous ar­gu­ment about how the ac­tion taken is x and the fu­ture-agent knows it,

Now, be­cause, the agent has proved that

if the agent re­sponds to proofs of by hav­ing the ex­pected value of the LUV con­di­tional on be­ing then in the argmax es­ti­mat­ing phase,

And then, by lin­ear­ity of ex­pec­ta­tion,

and then, by as­sump­tion 4 about con­di­tional self-trust, we get that

And this con­tra­dicts the start­ing as­sump­tion.

There­fore, if all the as­sump­tions hold, for all timesteps, as well as stuff like lin­ear­ity of ex­pec­ta­tion un­der con­di­tion­als, and the argmax policy is first in the tie-break­ing or­der­ing, then if there’s an ex­pected-util­ity max­i­mizer on the first step, it will keep tiling to ex­pected util­ity max­i­miz­ers for­ever af­ter­wards. Note that this also makes the im­plicit as­sump­tion in the argmax that the past ac­tion se­quence was done by ex­pected util­ity max­i­miz­ers true.

## ANALYSIS:

There are a few in­ter­est­ing fea­tures of this proof, even though it’s ul­ti­mately un­satis­fac­tory. One in­ter­est­ing fea­ture is that the ar­gu­ment fails when the agent has ex­plo­ra­tion steps. This feels tied to the is­sue where AIXI is op­ti­mal for the start­ing policy dis­tri­bu­tion, while AIXI+ex­plo­ra­tion is op­ti­mal for the start­ing policy dis­tri­bu­tion. The proof is dis­satis­fac­tory be­cause it heav­ily re­lies on ex­act equal­ities, which don’t hold for log­i­cal in­duc­tors (rather, the rele­vant prop­er­ties hold in the limit).

Another rea­son for dis­satis­fac­tion is that this is an EDT agent, and it doesn’t take into ac­count most of the more ex­otic de­ci­sion-the­ory stuff. In par­tic­u­lar, it’s quite un­clear what an agent of this type would do on Parfit’s Hitchiker, and I would at least like to gen­er­al­ize it to han­dle cases where the fu­ture ac­tion af­fects the prob­a­bil­ity of some ob­ser­va­tion in the past, even if it can’t quite at­tain the full force of UDT. There’s also the is­sue of the con­di­tional prob­a­bil­ities be­com­ing in­creas­ingly non­sen­si­cal (and self-fulfilling prophe­cies show­ing up, such as the cos­mic ray prob­lem) if we throw out ex­plo­ra­tion steps. After all, the nice prop­er­ties in the log­i­cal in­duc­tor pa­per only ap­ply to se­quences of ac­tions where the prob­a­bil­ity ap­proaches 0 slowly enough that the sum of prob­a­bil­ities across turns limits to in­finity (such as the har­monic se­quence). Fi­nally, I’m un­sure what this does on clas­sic prob­lems such as the pro­cras­ti­na­tion para­dox where you have to avoid press­ing a but­ton for as many turns as you can, but still press it even­tu­ally, and I’d be in­ter­ested in an­a­lyz­ing that.

There are still some in­ter­est­ing les­sons that can be taken from this, how­ever. Namely, that the abil­ity to steal the ac­tion of an­other policy is es­sen­tial to show that you won’t tile to it (be­cause you can just copy the ac­tion while keep­ing your de­ci­sion-mak­ing ar­chi­tec­ture in­tact, while oth­er­wise, fair en­vi­ron­ments can have in­cen­tives to self-mod­ify). Also, the ex­pected util­ity of “I take an ac­tion” be­ing equal to the ex­pected util­ity of “I take this spe­cific ac­tion that looks like the best so far” seems to be un­usu­ally im­por­tant, be­cause in con­junc­tion with fu­ture-trust, it al­lows work­ing around the core prob­lem of cur­rent you not know­ing what fu­ture-you will do.

Look­ing at the prob­lems from Log­i­cal In­duc­tor Tiling and Why It’s Hard, we can make the fol­low­ing re­cap:

The first prob­lem of CDT suck­ing can be over­come by con­di­tion­ing on “I take this ac­tion”

The sec­ond prob­lem of be­ing bad at con­sid­er­ing se­quences of ac­tions can be over­come by just look­ing one step ahead, and if you want to get cer­tain se­quences of ac­tions that look good, you just need to strat­egy-steal at each step from a policy that en­acts that se­quence of ac­tions, and this also has the nice prop­erty of pre­serv­ing the free­dom of what your ac­tion will be in the fu­ture if fu­ture-you thinks that some other ac­tion se­quence is ac­tu­ally bet­ter.

The third prob­lem is solved by the solu­tion pre­sented in the linked post.

The fourth prob­lem of for­get­ting in­for­ma­tion isn’t re­ally solved head-on, be­cause the fu­ture-trust prop­erty prob­a­bly fails if the fu­ture-agent throws away in­for­ma­tion. How­ever, it’s highly sug­ges­tive that this tiling re­sult just in­volves an ab­stract se­quence of past ac­tions, in­stead of un­pack­ing them into an ex­plicit se­quence of past ac­tions, and also this tiling re­sult just looks one step ahead in­stead of con­sid­er­ing a far-fu­ture ver­sion of it­self.

The fifth prob­lem of hav­ing strate­gies that re­act to the en­vi­ron­ment isn’t re­ally solved, but it does man­age to get around the prob­lem of not know­ing what some other al­gorithm will even­tu­ally do. All that is re­ally needed is that the fu­ture ver­sion of the agent knows what the other al­gorithm do so it can strat­egy-steal. This is still not a fully satis­fac­tory solu­tion, since there’s one policy that’s clearly more pow­er­ful than the rest of them, but it still seems like sub­stan­tial progress has been made.

The sixth prob­lem isn’t re­ally a fix­able prob­lem, it’s just a situ­a­tion where the agent has an in­cen­tive to mod­ify its policy to some­thing else be­cause it vi­o­lates the fair­ness con­di­tion.

The sev­enth prob­lem about be­ing un­able to ex­per­i­ment with the con­se­quences of a code rewrite is par­tially solved by as­sum­ing that the en­vi­ron­ment only cares about ac­tions, and par­tially solved by con­di­tion­ing on ac­tions in­stead of “I pick this policy” (this con­di­tional is higher prob­a­bil­ity)

Prob­lem 8 is dodged, and is pretty much the same as the fifth prob­lem.

Prob­lem 9 is not ad­dressed, be­cause this tiling proof as­sumes ex­act equal­ities, and doesn’t deal with the noise from log­i­cal in­duc­tors not be­ing perfect im­pairing suffi­ciently long chains of trust. How­ever, the fact that the proof only looks one step ahead feels promis­ing for not hav­ing com­pound­ing noise over time break­ing things.

• I sug­gest stat­ing the re­sult you’re prov­ing be­fore giv­ing the proof.

You have some un­usual no­ta­tion that I think makes some of this un­nec­es­sar­ily con­fus­ing. In­stead of this un­der­lined vs non-un­der­lined thing, you should have differ­ent func­tions $and , where the first maps ac­tion se­quences to util­ities, and the sec­ond maps a pair con­sist­ing of an ac­tion and a fu­ture policy to the util­ity of the ac­tion se­quence be­gin­ning with , fol­lowed by , fol­lowed by the ac­tion se­quence gen­er­ated by . Your first as­sump­tion would then be stated . Your sec­ond as­sump­tion (fair­ness of the en­vi­ron­ment) is im­plicit in the type sig­na­ture of the util­ity func­tion . If your util­ity de­pends on some­thing other than the ac­tion se­quence, then it doesn’t make sense to write it as a func­tion of the ac­tion se­quence. It’s good to point out as­sump­tions that are im­plicit in the for­mal­ism you’re us­ing, but by the time you iden­tify util­ity as a func­tion of ac­tion se­quences, you don’t need to as­sume fair­ness of the en­vi­ron­ment as an ad­di­tional ax­iom. I do not un­der­stand what your third as­sump­tion is. This is em­phat­i­cally false in gen­eral, but there’s a spe­cial con­di­tion that makes it vi­able, namely that the dis­tri­bu­tion at time n is guaran­teed to as­sign prob­a­bil­ity 1 to iff . My epistemic state about this is “this seems ex­tremely plau­si­ble, but I don’t know for sure if log­i­cal in­duc­tors at­tain this prop­erty in the limit” They don’t. For in­stance, let be any true un­de­cid­able sen­tence. The log­i­cal in­duc­tor does not as­sign prob­a­bil­ity 1 to even in the limit. Your fourth as­sump­tion does not seem rea­son­able. Does not give you what you want? Note that this only ex­plic­itly writes the start­ing code, and the code that might be mod­ified to, not the past or fu­ture ac­tion se­quence! This is im­por­tant for the agent to be able to rea­son about this com­pu­ta­tion, de­spite it tak­ing an in­finite in­put. I think this is ex­actly back­wards. The prop­erty that makes spaces easy to search through and rea­son about is com­pact­ness, not finite­ness. If is finite, then is com­pact, and thus easy to search through and rea­son about, pro­vided the rele­vant func­tions on it are con­tin­u­ous. But the space of com­puter pro­grams is an in­finite dis­crete space, hence non-com­pact, and hard to search through and rea­son about, ex­cept by re­mem­ber­ing that the pur­pose of se­lect­ing a pro­gram is so that it will gen­er­ate an el­e­ment of the nice, eas­ily-search­able com­pact space of ac­tion se­quences. • Note: looks like you were try­ing to use mark­down. To use mark­down in our ed­i­tor you need to press cmd-4. (Origi­nally the “$” no­ta­tion worked, but peo­ple who weren’t fa­mil­iar with LaTeX were con­sis­tently con­fused about to ac­tu­ally type a dol­lar sign)

• First: That no­ta­tion seems helpful. Fair­ness of the en­vi­ron­ment isn’t pre­sent by de­fault, it still needs to be as­sumed even if the en­vi­ron­ment is purely ac­tion-de­ter­mined, as you can con­sider an agent in the en­vi­ron­ment that is us­ing a hard­wired pre­dic­tor of what the argmax agent would do. It is just a piece of the en­vi­ron­ment, and feed­ing a differ­ent se­quence of ac­tions into the en­vi­ron­ment as in­put gets a differ­ent score, so the en­vi­ron­ment is purely ac­tion-de­ter­mined, but it’s still un­fair in the sense that the ex­pected util­ity of feed­ing ac­tion into the func­tion drops sharply if you con­di­tion on the argmax agent se­lect­ing ac­tion . The third con­di­tion was nec­es­sary to carry out this step. . The in­tu­itive in­ter­pre­ta­tion of the third con­di­tion is that, if you know that policy B se­lects ac­tion 4, then you can step from “ac­tion 4 is taken” to “policy B takes the ac­tions it takes”, and if you have a policy where you don’t know what ac­tion it takes (third con­di­tion is vi­o­lated), then “policy B does its thing” may have a higher ex­pected util­ity than any par­tic­u­lar ac­tion be­ing taken, even in a fair en­vi­ron­ment that only cares about ac­tion se­quences, as the ham­ster dance ex­am­ple shows.

Se­cond: I think you mi­s­un­der­stood what I was claiming. I wasn’t claiming that log­i­cal in­duc­tors at­tain the con­di­tional fu­ture-trust prop­erty, even in the limit, for all sen­tences or all true sen­tences. What I was claiming was: The fact that is prov­able or dis­prov­able in the fu­ture (in this case, is ), makes the con­di­tional fu­ture-trust prop­erty hold (I’m fairly sure), and for state­ments where there isn’t guaran­teed feed­back, the con­di­tional fu­ture-trust prop­erty may fail. The dou­ble-ex­pec­ta­tion prop­erty that you state does not work to carry the proof through, be­cause the proof (from the per­spec­tive of the first agent), takes as an as­sump­tion, so the “con­di­tional on ” part has to be out­side of the fu­ture ex­pec­ta­tion, when you go back to what the first agent be­lieves.

Third: the sense I meant for “agent is able to rea­son about this com­pu­ta­tion” is that the com­pu­ta­tion is not ex­tremely long, so log­i­cal in­duc­tor traders can bet on it.

• Ok, un­der­stood on the sec­ond as­sump­tion. is not a func­tion to , but a func­tion to the set of -val­ued ran­dom vari­ables, and your as­sump­tion is that this ran­dom vari­able is un­cor­re­lated with cer­tain claims about the out­puts of cer­tain poli­cies. The in­tu­itive ex­pla­na­tion of the third con­di­tion made sense; my com­plaint was that even with the in­tended in­ter­pre­ta­tion at hand, the for­mal state­ment made no sense to me.

I’m pretty sure you’re as­sum­ing that is re­solved on day , not that it is re­solved even­tu­ally.

Search­ing over the set of all Tur­ing ma­chines won’t halt in a rea­son­ably short amount of time, and in fact won’t halt ever, since the set of all Tur­ing ma­chines is non-com­pact. So I don’t see what you mean when you say that the com­pu­ta­tion is not ex­tremely long.

• Ah, the for­mal state­ment was some­thing like “if the policy A isn’t the argmax policy, the suc­ces­sor policy B must be in the policy space of the fu­ture argmax, and the ac­tion se­lected by policy A is com­puted so the rele­vant equal­ity holds”

Yeah, I am as­sum­ing fast feed­back that it is re­solved on day .

What I meant was that the com­pu­ta­tion isn’t ex­tremely long in the sense of de­scrip­tion length, not in the sense of com­pu­ta­tion time. Also, we aren’t do­ing policy search over the set of all tur­ing ma­chines, we’re do­ing policy search over some smaller set of poli­cies that can be guaran­teed to halt in a rea­son­able time (and more can be added as time goes on)

Also I’m less con­fi­dent in con­di­tional fu­ture-trust for all con­di­tion­als than I used to be, I’ll try to crys­tal­lize where I think it goes wrong.

• What I meant was that the com­pu­ta­tion isn’t ex­tremely long in the sense of de­scrip­tion length, not in the sense of com­pu­ta­tion time. Also, we aren’t do­ing policy search over the set of all tur­ing ma­chines, we’re do­ing policy search over some smaller set of poli­cies that can be guaran­teed to halt in a rea­son­able time (and more can be added as time goes on)

Wouldn’t the set of all ac­tion se­quences have lower de­scrip­tion length than some large finite set of poli­cies? There’s also the po­ten­tial prob­lem that all of the poli­cies in the large finite set you’re search­ing over could be quite far from op­ti­mal.

• I sec­ond the sug­ges­tion to state what is be­ing proved be­fore prov­ing it.

One im­por­tant note is that CDT spec­tac­u­larly fails this prop­erty. Namely, con­sider a game of match­ing pen­nies against a pow­er­ful pre­dic­tor. Since the en­vi­ron­ment takes ac­tions as in­put, it’s pos­si­ble to re­com­pute what would have hap­pened if a differ­ent ac­tion is plugged in. The CDT agent that keeps los­ing is go­ing to learn to ran­dom­ize be­tween ac­tions since it keeps see­ing that the ac­tion it didn’t take would have done bet­ter. So it even­tu­ally gets to a state where it pre­dicts the re­ward from “pick heads” and “pick tails” is 0.5 (be­cause there’s a 50% chance it doesn’t pick heads/​tails), but it pre­dicts the re­ward from “I take an ac­tion” is 0, vi­o­lat­ing this as­sump­tion.

Note, how­ever, that or­di­nary Bayes-op­ti­mal RL works perfectly (as­sum­ing there are no traps in the prior or para­noia is oth­er­wise avoided), since it would be­lieve that tak­ing a cer­tain ac­tion causes the pre­dic­tor to make the op­ti­mal re­sponse. This is similar to RL one-box­ing in the re­peated New­comb’s prob­lem.