# Decision Theory

A longer text-based ver­sion of this post is also available on MIRI's blog here, and the bibliog­ra­phy for the whole se­quence can be found here.

Cross-post­ing some com­ments from the MIRI Blog:

Kon­stantin Surkov:

Re: 510 problem
I don’t get it. Hu­man is ob­vi­ously (in that re­gard) an agent rea­son­ing about his ac­tions. Hu­man also will choose 10 with­out any difficulty. What in hu­man de­ci­sion mak­ing pro­cess is not for­mal­iz­able here? As­sum­ing we agree that 10 is ra­tio­nal choice.

Abram Dem­ski:

Sup­pose you know that you take the $10. How do you rea­son about what would hap­pen if you took the$5 in­stead? It seems easy if you know how to sep­a­rate your­self from the world, so that you only think of ex­ter­nal con­se­quences (get­ting $5). If you think about your­self as well, then you run into con­tra­dic­tions when you try to imag­ine the world where you take the$5, be­cause you know it is not the sort of thing you would do. Maybe you have some ab­surd pre­dic­tions about what the world would be like if you took the $5; for ex­am­ple, you imag­ine that you would have to be blind. That’s alright, though, be­cause in the end you are tak­ing the$10, so you’re do­ing fine.
Part of the point is that an agent can be in a similar po­si­tion, ex­cept it is tak­ing the $5, knows it is tak­ing the$5, and un­able to figure out that it should be tak­ing the $10 in­stead due to the ab­surd pre­dic­tions it makes about what hap­pens when it takes the$10. It seems kind of hard for a hu­man to end up in that situ­a­tion, but it doesn’t seem so hard to get this sort of thing when we write down for­mal rea­son­ers, par­tic­u­larly when we let them rea­son about them­selves fully (as nat­u­ral parts of the world) rather than only rea­son­ing about the ex­ter­nal world or hav­ing pre-pro­grammed di­vi­sions (so they rea­son about them­selves in a differ­ent way from how they rea­son about the world).
• Sure, one can imag­ine hy­po­thet­i­cally tak­ing $5, even if in re­al­ity they would take$10. That’s a spu­ri­ous out­put from a differ­ent al­gorithm al­to­gether. it as­sumes the world where you are not the same per­son who takes $10. So, it would make sense to ex­am­ine which of the two you are, if you don’t yet know that you will take$10, but not if you already know it. Which of the two is it?

• I’m not con­vinced that an in­con­se­quen­tial grain of un­cer­tainty couldn’t han­dle this 5-10 prob­lem. Con­sider an agent whose ac­tions are prob­a­bil­ity dis­tri­bu­tions on {5,10} that are nowhere 0. We can call these points in the open af­fine space spanned by the points 5 and 10. U is then a lin­ear func­tion from this af­fine space to util­ities. The agent would search for proofs that U is some par­tic­u­lar such lin­ear func­tion. Once it finds one, it uses that lin­ear func­tion to com­pute the op­ti­mal ac­tion. To en­sure that there is an op­ti­mum, we can ad­join in­finites­i­mal val­ues to the pos­si­ble prob­a­bil­ities and util­ities.

If the agent were to find a proof that the lin­ear func­tion is the one in­duced by map­ping 5 to 5 and 10 to 0, it would re­turn (1-ε)⋅5+ε⋅10 and get util­ity 5+5ε in­stead of the ex­pected 5-5ε, so Löb’s the­o­rem wouldn’t make this self-fulfilling.

• So, your sug­ges­tion is not just an in­con­se­quen­tial grain of un­cer­tainty, it is an grain of ex­plo­ra­tion. The agent ac­tu­ally does take 10 with some small prob­a­bil­ity. If you try to do this with just un­cer­tainty, things would be worse, since that un­cer­tainty would not be jus­tified.

One prob­lem is that you ac­tu­ally do ex­plore a bunch, and since you don’t get a re­set but­ton, you will some­times ex­plore into ir­re­versible ac­tions, like shut­ting your­self off. How­ever, if the agent has a source of ran­dom­ness, and also the abil­ity to simu­late wor­lds in which that ran­dom­ness went an­other way, you can have an agent that with prob­a­bil­ity does not ex­plore ever, and learns from the other wor­lds in which it does ex­plore. So, you can ei­ther ex­plore for­ever, and shut your­self off, or you can ex­plore very very rarely and learn from other pos­si­ble wor­lds.

The prob­lem with learn­ing from other pos­si­ble wor­lds is to get good re­sults out of it, you have to as­sume that the en­vi­ron­ment does not also learn from other pos­si­ble wor­lds, which is not very em­bed­ded.

But you are sug­gest­ing ac­tu­ally ex­plor­ing a bunch, and there is a prob­lem other than just shut­ting your­self off. You are get­ting past this prob­lem in this case by only al­low­ing lin­ear func­tions, but that is not an ac­cu­rate as­sump­tion. Let’s say you are play­ing match­ing pen­nies with Omega, who has the abil­ity to pre­dict what prob­a­bil­ity you will pick but not what ac­tion you will pick.

(In match­ing pen­nies, you each choose H or T, you win if they match, they win if they don’t.)

Omega will pick H if your prob­a­bil­ity of H is less that 12 and T oth­er­wise. Your util­ity as a func­tion of prob­a­bil­ity is piece­wise lin­ear with two parts. Try­ing to as­sume that it will be lin­ear will make things messy.

There is this prob­lem where some­times the out­come of ex­plor­ing into tak­ing 10, and the out­come of ac­tu­ally tak­ing 10 be­cause it is good are differ­ent. More on this here.

• I am talk­ing about the sur­real num­ber ε, which is smaller than any pos­i­tive real. Events of like­li­hood ε do not ac­tu­ally hap­pen, we just keep them around so the coun­ter­fac­tual rea­son­ing does not di­vide by 0.

Within the simu­la­tion, the AI might be able to con­clude that it just made an ε-like­li­hood de­ci­sion and must there­fore be in a coun­ter­fac­tual simu­la­tion. It should of course carry on as it were, in or­der to help the simu­lat­ing ver­sion of it­self.

Why shouldn’t the en­vi­ron­ment be learn­ing?

To the Omega sce­nario I would say that since we have an Omega-proof ran­dom num­ber gen­er­a­tor, we get new strate­gic op­tions that should be in­cluded in the available ac­tions. The lin­ear func­tion then goes from the ε-ad­joined open af­fine space gen­er­ated by {Pick H with prob­a­bil­ity p | p real, non-nega­tive and at most 1} to the ε-ad­joined util­ities, and we cor­rectly solve Omega’s prob­lem by us­ing p=1/​2.

• Yeah, so its like you have this pri­vate data, which is an in­finite se­quence of bits, and if you see all 0′s you take an ex­plo­ra­tion ac­tion. I think that by giv­ing the agent these pri­vate bits and promis­ing that the bits do not change the rest of the world, you are es­sen­tially giv­ing the agent ac­cess to a causal coun­ter­fac­tual that you con­structed. You don’t even have to mix with what the agent ac­tu­ally does, you can ex­plore with ev­ery ac­tion and ask if it is bet­ter to ex­plore and take 5 or ex­plore and take 10. By do­ing this, you are es­sen­tially giv­ing the agent ac­cess to a causal coun­ter­fac­tual, be­cause con­di­tion­ing on these in­finites­i­mals is ba­si­cally like com­ing in and chang­ing what the agent does. I think giv­ing the agent a true source of ran­dom­ness ac­tu­ally does let you im­ple­ment CDT.

If the en­vi­ron­ment learns from the other pos­si­ble wor­lds, It might pun­ish or re­ward you in one world for stuff that you do in the other world, so you cant just ask which world is best to figure out what to do.

I agree that that is how you want to think about the match­ing pen­nies prob­lem. How­ever the point is that your pro­posed solu­tion as­sumed lin­ear­ity. It didn’t em­piri­cally ob­serve lin­ear­ity. You have to be able to tell the differ­ence be­tween the situ­a­tions in or­der to know not to as­sume lin­ear­ity in the match­ing pen­nies prob­lem. The method for tel­ling the differ­ence is how you de­ter­mine whether or not and in what ways you have log­i­cal con­trol over Omega’s pre­dic­tion of you.

• I posit that lin­ear­ity always holds. In a de­ter­minis­tic uni­verse, the lin­ear func­tion is be­tween the ε-ad­joined open af­fine space gen­er­ated by our prim­i­tive set of ac­tions and the ε-ad­joined util­ities. (Like in my first com­ment.)

In a prob­a­bil­is­tic uni­verse, the lin­ear func­tion is be­tween the ε-ad­joined open af­fine space gen­er­ated by (the set of points in) the closed af­fine space gen­er­ated by our prim­i­tive set of ac­tions and the ε-ad­joined util­ities. (Like in my sec­ond com­ment.)

I got from one of your com­ments that as­sum­ing lin­ear­ity wards off some prob­lem. Does it come back in the prob­a­bil­is­tic-uni­verse case?

• My point was that I don’t know where to as­sume the lin­ear­ity is. When­ever I have pri­vate ran­dom­ness, I have lin­ear­ity over what I end up choos­ing with that ran­dom­ness, but not lin­ear­ity over what prob­a­bil­ity I choose. But I think this is non get­ting at the dis­agree­ment, so I pivot to:

In your model, what does it mean to prove that U is some lin­ear af­fine func­tion? If I prove that my prob­a­bil­ity p is 12 and that U=7.5, have I proven that U is the con­stant func­tion 7.5? If there is only one value of p, it is not defined what the util­ity func­tion is, un­less I suc­cess­fully carve the uni­verse in such a way as to let me re­place the ac­tion with var­i­ous things and see what hap­pens. (or, as­sum­ing lin­ear­ity re­place the prob­a­bil­ity with enough lin­early in­de­pen­dent things (in this case 2) to define the func­tion.

• In the match­ing pen­nies game, would be proven to be . A could max­i­mize this by re­turn­ing ε when isn’t , and (where ε is so small that this is still in­finites­i­mally close to 1) when is .

The lin­ear­ity is always in the func­tion be­tween ε-ad­joined open af­fine spaces. Whether the util­ities also end up lin­ear in the closed af­fine space (ie no­body cares about our rea­son­ing pro­cess) is for the ob­ject-level in­for­ma­tion gath­er­ing pro­cess to de­duce from the en­vi­ron­ment.

You never prove that you will with cer­tainty de­cide . You always leave a so-you’re-say­ing-there’s-a chance of ex­plo­ra­tion, which pro­duces a grain of un­cer­tainty. To ex­e­cute the ac­tion, you in­spect the cer­e­mo­nial Boltz­mann Bit (which is im­ple­mented by be­ing con­stantly set to “dis­card the ε”), but which you treat as hav­ing an ε chance of flip­ping.

The self-mod­ifi­ca­tion mod­ule could note that in­spect­ing that bit is a no-op, see that re­mov­ing it would make the coun­ter­fac­tual rea­son­ing mod­ule crash, and leave up the Ch­ester­ton fence.

• But how do you avoid prov­ing with cer­tainty that p=1/​2?

Since your pro­posal does not say what to do if you find in­con­sis­tent proofs that the lin­ear func­tion is two differ­ent things, I will as­sume that if it finds mul­ti­ple differ­ent proofs, it de­faults to 5 for the fol­low­ing.

Here is an­other ex­am­ple:

You are in a 5 and 10 prob­lem. You have twin that is also in a 5 and 10 prob­lem. You have ex­actly the same source code. There is a con­sis­tency checker, and if you and your twin do differ­ent things, you both get 0 util­ity.

You can prove that you and your twin do the same thing. Thus you can prove that the func­tion is 5+5p. You can also prove that your twin takes 5 by Lob’s the­o­rem. (You can also prove that you take 5 by Lob’s the­o­rem, but you ig­nore that proof, since “there is always a chance”) Thus, you can prove that the func­tion is 5-5p. Your sys­tem doesn’t know what to do with two func­tions, so it de­faults to 5. (If it is prov­able that you both take 5, you both take 5, com­plet­ing the proof by Lob.)

I am do­ing the same thing as be­fore, but be­cause I put it out­side of the agent, it does not get flagged with the “there is always a chance” mod­ule. This is try­ing to illus­trate that your pro­posal takes ad­van­tage of a sep­a­ra­tion be­tween the agent and the en­vi­ron­ment that was snuck in, and could be done in­cor­rectly.

Two pos­si­ble fixes:

1) You could say that the agent, in­stead of tak­ing 5 when find­ing in­con­sis­tency takes some ac­tion that ex­hibits the in­con­sis­tency (some­thing that the two func­tions give differ­ent val­ues). This is very similar to the chicken rule, and if you add some­thing like this, you don’t re­ally need the rest of your sys­tem. If you take an agent that when­ever it proves it does some­thing, it does some­thing else. This agent will prove (given enough time) that if it takes 5 it gets 5, and if it takes 10 it gets 10.

2) I had one proof sys­tem, and just ig­nored the proofs that I found that I did a thing. I could in­stead give the agent a spe­cial proof sys­tem that is in­ca­pable of prov­ing what it does, but how do you do that? Chicken rule seems like the place to start.

One prob­lem with the chicken rule is that it was de­vel­oped in a sys­tem that was de­duc­tively closed, so you can’t prove some­thing that passes though a proof of P with­out prov­ing P. If you vi­o­late this, by hav­ing a ran­dom the­o­rem prover, you might have an sys­tem that fails to prove “I take 5” but proves “I take 5 and 1+1=2″ and uses this to com­plete the Lob loop.

• I can’t prove what I’m go­ing to do and I can’t prove that I and the twin are go­ing to do the same thing, be­cause of the Boltz­mann Bits in both of our de­ci­sion-mak­ers that might turn out differ­ent ways. But I can prove that we have a chance of do­ing the same thing, and my ex­pected util­ity is , round­ing to once it ac­tu­ally hap­pens.

# A solv­able New­comb-like problem

• Con­tent feed­back : the in­fer­en­tial dis­tance be­tween Löb’s the­o­rem and spu­ri­ous coun­ter­fac­tu­als seems larger than that of the other points. Maybe that’s be­cause I haven’t in­ter­nal­ised the the­o­rem, not be­ing a lo­gi­cian and all.

Un­nec­es­sary nit­pick: the gears in the robot’s brain would turn just fine as drawn: since the outer gears are both turn­ing an­ti­clock­wise, the in­ner gear would just turn clock­wise. (I think my in­ner en­g­ineer is show­ing)

• If you know your own ac­tions, why would you rea­son about tak­ing differ­ent ac­tions? Wouldn’t you rea­son about some­one who is al­most like you, but just differ­ent enough to make a differ­ent choice?

• Sure. How do you do that?

• No­tice (well, you already know that) that ac­cept­ing that iden­ti­cal agents make iden­ti­cal de­ci­sions (su­per­ra­tional­ity, as it were) and to make differ­ent de­ci­sions in iden­ti­cal cir­cum­stances the agents must nec­es­sar­ily be differ­ent, gets you out of many pick­les. For ex­am­ple, in the 5&10 game an agent would ex­am­ine its own al­gorithm, see that it leads to tak­ing $10 and stop there. There is no “what would hap­pen if you took a differ­ent ac­tion”, be­cause the agent tak­ing a differ­ent ac­tion would not be you, not ex­actly. So, no Lo­bian ob­sta­cle. In re­turn, you give up some­thing a lot more emo­tion­ally valuable: the delu­sion of mak­ing con­scious de­ci­sions. Pick your poi­son. • For ex­am­ple, in the 5&10 game an agent would ex­am­ine its own al­gorithm, see that it leads to tak­ing$10 and stop there.

Why do even that much if this rea­son­ing could not be used? The ques­tion is about the rea­son­ing that could con­tribute to the de­ci­sion, that could de­scribe the al­gorithm, and so has the op­tion to not “stop there”. What if you see that your al­gorithm leads to tak­ing the $10 and in­stead of stop­ping there, you take the$5?

Noth­ing stops you. This is the “chicken rule” and it solves some is­sues, but more im­por­tantly illus­trates the pos­si­bil­ity in how a de­ci­sion al­gorithm can func­tion. The fact that this is a thing is ev­i­dence that there may be some­thing wrong with the “stop there” pro­posal. Speci­fi­cally, you usu­ally don’t know that your rea­son­ing is ac­tual, that it’s even log­i­cally pos­si­ble and not part of an im­pos­si­ble coun­ter­fac­tual, but this is not a hope­less hy­po­thet­i­cal where noth­ing mat­ters. Noth­ing com­pels you to af­firm what you know about your ac­tions or con­clu­sions, this is not a ne­ces­sity in a de­ci­sion mak­ing al­gorithm, but differ­ent things you do may have an im­pact on what hap­pens, be­cause the situ­a­tion may be ac­tual af­ter all, de­pend­ing on what hap­pens or what you de­cide, or it may be pre­dicted from within an ac­tual situ­a­tion and in­fluence what hap­pens there. This mo­ti­vates learn­ing to rea­son in and about pos­si­bly im­pos­si­ble situ­a­tions.

What if you ex­am­ine your al­gorithm and find that it takes the $5 in­stead? It could be the same al­gorithm that takes the$10, but you don’t know that, in­stead you ar­rive at the $5 con­clu­sion us­ing rea­son­ing that could be im­pos­si­ble, but that you don’t know to be im­pos­si­ble, that you haven’t de­cided yet to make im­pos­si­ble. One way to solve the is­sue is to ren­der the situ­a­tion where that holds im­pos­si­ble, by con­tra­dict­ing the con­clu­sion with your ac­tion, or in some other way. To know when to do that, you should be able to rea­son about and within such situ­a­tions that could be im­pos­si­ble, or could be made im­pos­si­ble, in­clud­ing by the de­ci­sions made in them. This makes the way you rea­son in them rele­vant, even when in the end these situ­a­tions don’t oc­cur, be­cause you don’t a pri­ori know that they don’t oc­cur. (The 5-and-10 prob­lem is not speci­fi­cally about this is­sue, and ex­plicit rea­son­ing about im­pos­si­ble situ­a­tions may be avoided, per­haps should be avoided, but my guess is that the crux in this com­ment thread is about things like use­ful­ness of rea­son­ing from within pos­si­bly im­pos­si­ble situ­a­tions, where even your own knowl­edge ar­rived at by pure com­pu­ta­tion isn’t nec­es­sar­ily cor­rect.) • Thank you for your ex­pla­na­tion! Still try­ing to un­der­stand it. I un­der­stand that there is no point ex­am­in­ing one’s al­gorithm if you already ex­e­cute it and see what it does. What if you see that your al­gorithm leads to tak­ing the$10 and in­stead of stop­ping there, you take the $5? I don’t un­der­stand that point. you say “noth­ing stops you”, but that is only pos­si­ble if you could act con­trary to your own al­gorithm, no? Which makes no sense to me, un­less the same al­gorithm gives differ­ent out­comes for differ­ent in­puts, e.g. “if I sim­ply run the al­gorithm, I take$10, but if I ex­am­ine the al­gorithm be­fore run­ning it and then run it, I take $5″. But it doesn’t seem like the thing you mean, so I am con­fused. What if you ex­am­ine your al­gorithm and find that it takes the$5 in­stead?

How can it be pos­si­ble? if your ex­am­i­na­tion of your al­gorithm is ac­cu­rate, it gives the same out­come as mind­lessly run­ning it, with is tak­ing $10, no? It could be the same al­gorithm that takes the$10, but you don’t know that, in­stead you ar­rive at the $5 con­clu­sion us­ing rea­son­ing that could be im­pos­si­ble, but that you don’t know to be im­pos­si­ble, that you haven’t de­cided yet to make im­pos­si­ble. So your rea­son­ing is in­ac­cu­rate, in that you ar­rive to a wrong con­clu­sion about the al­gorithm out­put, right? You just don’t know where the er­ror lies, or even that there is an er­ror to be­gin with. But in this case you would ar­rive to a wrong con­clu­sion about the same al­gorithm run by a differ­ent agent, right? So there is noth­ing spe­cial about it be­ing your own al­gorithm and not some­one else’s. If so, the is­sue is re­duced to find­ing an ac­cu­rate al­gorithm anal­y­sis tool, for an al­gorithm that demon­stra­bly halts in a very short time, pro­duc­ing one of the two pos­si­ble out­comes. This seems to have lit­tle to do with de­ci­sion the­ory is­sues, so I am lost as to how this is rele­vant to the situ­a­tion. I am clearly miss­ing some of your logic here, but I still have no idea what the miss­ing piece is, un­less it’s the liber­tar­ian free will thing, where one can act con­trary to one’s pro­gram­ming. Any fur­ther help would be greatly ap­pre­ci­ated. • I un­der­stand that there is no point ex­am­in­ing one’s al­gorithm if you already ex­e­cute it and see what it does. Rather there is no point if you are not go­ing to do any­thing with the re­sults of the ex­am­i­na­tion. It may be use­ful if you make the de­ci­sion based on what you ob­serve (about how you make the de­ci­sion). you say “noth­ing stops you”, but that is only pos­si­ble if you could act con­trary to your own al­gorithm, no? You can, for a cer­tain value of “can”. It won’t have hap­pened, of course, but you may still de­cide to act con­trary to how you act, two differ­ent out­comes of the same al­gorithm. The con­tra­dic­tion proves that you didn’t face the situ­a­tion that trig­gers it in ac­tu­al­ity, but the con­tra­dic­tion re­sults pre­cisely from de­cid­ing to act con­trary to the ob­served way in which you act, in a situ­a­tion that a pri­ori could be ac­tual, but is ren­dered coun­ter­log­i­cal as a re­sult of your de­ci­sion. If in­stead you af­firm the ob­served ac­tion, then there is no con­tra­dic­tion and so it’s pos­si­ble that you have faced the situ­a­tion in ac­tu­al­ity. Thus the “chicken rule”, play­ing chicken with the uni­verse, mak­ing the pre­sent situ­a­tion im­pos­si­ble when you don’t like it. So your rea­son­ing is inaccurate You don’t know that it’s in­ac­cu­rate, you’ve just run the com­pu­ta­tion and it said$5. Maybe this didn’t ac­tu­ally hap­pen, but you are con­sid­er­ing this situ­a­tion with­out know­ing if it’s ac­tual. If you ig­nore the com­pu­ta­tion, then why run it? If you run it, you need re­sponses to all pos­si­ble re­sults, and all pos­si­ble re­sults ex­cept one are not ac­tual, yet you should be ready to re­spond to them with­out know­ing which is which. So I’m dis­cussing what you might do for the re­sult that says that you take the $5. And in the end, the use you make of the re­sults is by choos­ing to take the$5 or the $10. This map from pre­dic­tions to de­ci­sions could be any­thing. It’s triv­ial to write an al­gorithm that in­cludes such a map. Of course, if the map di­ag­o­nal­izes, then the pre­dic­tor will fail (won’t give a pre­dic­tion), but the map is your rea­son­ing in these hy­po­thet­i­cal situ­a­tions, and the fact that the map may say any­thing cor­re­sponds to the fact that you may de­cide any­thing. The map doesn’t have to be iden­tity, de­ci­sion doesn’t have to re­flect pre­dic­tion, be­cause you may write an al­gorithm where it’s not iden­tity. • You can, for a cer­tain value of “can”. It won’t have hap­pened, of course, but you may still de­cide to act con­trary to how you act, two differ­ent out­comes of the same al­gorithm. This con­fuses me even more. You can imag­ine act con­trary to your own al­gorithm, but the imag­in­ing differ­ent pos­si­ble out­comes is a side effect of run­ning the main al­gorithm that takes$10. It is never the out­come of it. Or an out­come. Since you know you will end up tak­ing $10, I also don’t un­der­stand the idea of play­ing chicken with the uni­verse. Are there any refer­ences for it? You don’t know that it’s in­ac­cu­rate, you’ve just run the com­pu­ta­tion and it said$5.

Wait, what? We started with the as­sump­tion that ex­am­in­ing the al­gorithm, or run­ning it, shows that you will take $10, no? I guess I still don’t un­der­stand how What if you see that your al­gorithm leads to tak­ing the$10 and in­stead of stop­ping there, you take the \$5?

is even pos­si­ble, or worth con­sid­er­ing.

This map from pre­dic­tions to de­ci­sions could be any­thing.

Hmm, maybe this is where I miss some of the logic. If the pre­dic­tions are ac­cu­rate, the map is bi­jec­tive. If the pre­dic­tions are in­ac­cu­rate, you need a bet­ter al­gorithm anal­y­sis tool.

The map doesn’t have to be iden­tity, de­ci­sion doesn’t have to re­flect pre­dic­tion, be­cause you may write an al­gorithm where it’s not iden­tity.

To me this screams “get a bet­ter al­gorithm an­a­lyzer!” and has noth­ing to do with whether it’s your own al­gorithm, or some­one else’s. Can you maybe give an ex­am­ple where one ends up in a situ­a­tion where there is no ob­vi­ous al­gorithm an­a­lyzer one can ap­ply?

• It was not un­til read­ing this that I re­ally un­der­stood that I am in the habit of rea­son­ing about my­self as just a part of the en­vi­ron­ment.

• The kicker is that we don’t rea­son di­rectly about our­selves as such, we use a sim­plified model of our­selves. And we’re REALLY GOOD at us­ing that model for causal rea­son­ing, even when it is re­flec­tive, and in­volves mul­ti­ple lev­els of self-re­flec­tion and coun­ter­fac­tu­als—at least when we bother to try. (We try rarely be­cause ex­plicit mod­el­ling is cog­ni­tively de­mand­ing, and we usu­ally use de­faults /​ con­di­tioned rea­son­ing. Some­times that’s OK.)

Ex­am­ple: It is 10PM. A 5-page re­port is due in 12 hours, at 10AM.

De­fault: Go to sleep at 1AM, set alarm for 8AM. Re­sult: Don’t finish re­port tonight, have too lit­tle time to do so to­mor­row.

Con­di­tioned rea­son­ing: Stay up to finish the re­port first. 5 hours of work, and stay up un­til 3AM. Re­sult? Write bad re­port, still feel ex­hausted the next day

Coun­ter­fac­tual rea­son­ing: I should nap /​ get some amount of sleep so that I am bet­ter able to con­cen­trate, which will out­weigh the lost time. I could set my alarm for any amount of time; what amount does my model of my­self im­ply will lead to an op­ti­mal well-rested /​ suffi­cient time trade-off?

Self-re­flec­tion prob­lem, sec­ond use of mini-self model: I’m worse at rea­son­ing at 1AM than I am at 10PM. I should de­cide what to do now, in­stead of de­lay­ing un­til then. I think go­ing to sleep at 12AM and wak­ing at 3AM gives me enough rest and time to do a good job on the re­port.

Con­sider coun­ter­fac­tual and im­pact: How does this im­pact the rest of my week’s sched­ule? 3 hours is lo­cally op­ti­mal, but I will crash to­mor­row and I have a test to study for the next day. De­cide to work a bit, go to sleep at 12:30 and set alarm for 5:30AM. Finish the re­port, turn it in by 10AM, then nap an­other 2 hours be­fore study­ing.

We built this model based on not only small sam­ples of our own his­tory, but learn­ing from oth­ers, in­cor­po­rat­ing data about see­ing other peo­ple’s ex­pe­riences. We don’t con­sider stay­ing up all night and then driv­ing to hand­ing the re­port, be­cause we re­al­ize ex­hausted driv­ing is dan­ger­ous—be­cause we heard sto­ries of peo­ple do­ing so, and know that we would be similarly un­steady. Is a per­son go­ing to ex­plore and try differ­ent strate­gies by stay­ing up all night and driv­ing? If you die, you can’t learn from the ex­pe­rience—so you have good ideas ab out what parts of the ex­plo­ra­tion space are safe to try. You might use Ad­der­all be­cause it’s been tried be­fore and is rel­a­tively safe, but you don’t in­gest ar­bi­trary drugs to see if they help you think.

BUT an AI doesn’t (at first) have that sam­ple data to rea­son from, nor does a sin­gle­ton have ob­ser­va­tion of other near-struc­turally iden­ti­cal AI sys­tems and the im­pacts of their de­ci­sions, nor does it have a fun­da­men­tal un­der­stand­ing about what is safe to ex­plore.

• Con­tent feed­back:

The Pre­face to the Se­quence on Value Learn­ing con­tains the fol­low­ing ad­vice on re­search di­rec­tions for that se­quence:

If you try to dis­prove the ar­gu­ments in the posts, or to cre­ate for­mal­isms that sidestep the is­sues brought up, you may very well gen­er­ate a new in­ter­est­ing di­rec­tion of work that has not been con­sid­ered be­fore.

This pro­vides spe­cific di­rec­tion on what to look at and what work needs done. If such a state­ment for this se­quence is pos­si­ble, I think it would be valuable to in­clude.

• ## Thoughts on coun­ter­fac­tual reasoning

Th­ese ex­am­ples of coun­ter­fac­tu­als are pre­sented as equiv­a­lent, but they seem mean­ingfully dis­tinct:

What if the sun sud­denly went out?
What if 2+2=3?

Speci­fi­cally, they don’t seem equally difficult for me to eval­u­ate. I can eas­ily imag­ine the sun go­ing out, but I’m not even sure what it would mean if 2+2=3. It con­fuses me that these two differ­ent ex­am­ples are pre­sented as equiv­a­lent, be­cause they seem to be in­stances of mean­ingfully dis­tinct classes of some­thing. I spent some time try­ing to char­ac­ter­ize why the sun ex­am­ple is in­tu­itively easy for me and the math ex­am­ple is in­tu­itively difficult for me. I came up with some ideas, but I won’t go into de­tails yet be­cause they seem like the ob­vi­ous sorts of things that any­one who has read The Se­quences (a.k.a., Ra­tion­al­ity: A-Z) would have thought of. I strongly sus­pect there’s prior work. It is also pos­si­ble that I don’t fully un­der­stand the prob­lem yet.

The two coun­ter­fac­tual rea­son­ing ex­am­ples above (and oth­ers) are pre­sented as equiv­a­lent, but they seem like they are not.

1. Is this an in­ten­tional sim­plifi­ca­tion for the benefit of new read­ers?

2. If so, can some­one point me to the prior work ex­plor­ing the omit­ted nu­ances of coun­ter­fac­tu­als? I don’t want to re-in­vent the wheel.

3. If not, would ex­plo­ra­tion of the char­ac­ter­is­tics of differ­ent kinds of coun­ter­fac­tu­als be a fruit­ful area of re­search?

• I won­der how much an agent could achieve by think­ing along the fol­low­ing lines:

Big Brad is a hu­man-shaped robot who works as a lum­ber­jack. One day his em­ployer sends him into town on his mo­tor­bike car­ry­ing two chain­saws, to get them sharp­ened. Brad no­tices an un­usual num­ber of the hu­mans around him sud­denly cross­ing streets to keep their dis­tance from him.

Maybe they don’t like the smell of chain­saw oil? So he asks one rather slow pedes­trian “Why are peo­ple keep­ing their dis­tance?” to which the pedes­trian replies “Well, what if you at­tacked us?”

Now in the pedes­trian’s mind, that’s a rea­son­able re­sponse. If Big Brad did at­tack some­one walk­ing next to them, with­out no­tice, Brad would be able to cut them in half. To hu­mans who ex­pect large bike-rid­ing peo­ple car­ry­ing po­ten­tial weapons to be dis­pro­por­tionately likely to be vi­o­lent with­out no­tice, be­ing at­tacked by Brad seems a rea­son­able fear, wor­thy of ex­pend­ing a lit­tle effort to al­ter walk­ing routes to al­low run­ning away if Brad is vi­o­lent.

But Brad knows that Brad would never do such a thing. Ini­tially, it might seem like ask­ing Brad “What if 2 + 2 equalled 3?”

But if Brad can think about the prob­lem in terms of what in­for­ma­tion is available to the var­i­ous ac­tors in the sce­nario, he can re­frame the pedes­trian’s ques­tion as: “What if an agent that, given the in­for­ma­tion I have so far, is in­dis­t­in­guish­able from you, were to at­tack us?”

If Brad is aware that ran­dom pedes­tri­ans in the street don’t know Brad per­son­ally, to the level of be­ing con­fi­dent about Brad’s in­ter­nal rules and val­ues, and he can hy­poth­e­sise the ex­is­tence of an al­ter­na­tive be­ing, Brad’ that a pedes­trian might con­sider would plau­si­bly ex­ist and would have differ­ent in­ter­nal rules and val­ues to those of Brad yet oth­er­wise ap­pear iden­ti­cal, then Brad has a way for­wards to think through the prob­lem.

On the more gen­eral ques­tion of whether it would be use­ful for Brad to have the abil­ity to ask him­self: “What if the uni­verse were other than I think it is? What if magic works and I just don’t know that yet? What if my self-knowl­edge isn’t 100% re­li­able, be­cause there are em­bed­ded com­mands in my own code that I’m cur­rently be­ing kept from be­ing aware of by those same com­mands? Per­haps I should al­lo­cate a minute prob­a­bil­ity to the sce­nario that some­where there ex­ists a lightswitch that’s mag­i­cally con­nected to the sun and which, in defi­ance of known physics, can just turn it off and on?”, with care­ful al­lo­ca­tion of prob­a­bil­ities that might avoid di­vide-by-zero prob­lems, but I don’t think it is a panacea—there are ad­di­tional ap­proaches to coun­ter­fac­tual think­ing that may be more pro­duc­tive in some cir­cum­stances.

• Why does be­ing up­date­less re­quire think­ing through all pos­si­bil­ities in ad­vance? Can you not make a gen­eral com­mit­ment to fol­low UDT, but wait un­til you ac­tu­ally face the de­ci­sion prob­lem to figure out which spe­cific ac­tion UDT recom­mends tak­ing?

• Sure, but what com­pu­ta­tion do you then do, to figure out what UDT recom­mends? You have to have, writ­ten down, a spe­cific prior which you eval­u­ate ev­ery­thing with. That’s the prob­lem. As dis­cussed in Embed­ded World Models, a Bayesian prior is not a very good ob­ject for an em­bed­ded agent’s be­liefs, due to re­al­iz­abil­ity/​grain-of-truth con­cerns; that is, speci­fi­cally be­cause a Bayesian prior needs to list all pos­si­bil­ities ex­plic­itly (to a greater de­gree than, e.g., log­i­cal in­duc­tion).

• I think I don’t un­der­stand the Löb’s the­o­rem ex­am­ple.

If is prov­able, then , so it is true (be­cause the state­ment about is vac­u­ously true). Hence by Löb’s the­o­rem, it’s prov­able, so we get .

If is prov­able, then it’s true, for the dual rea­son. So by Löb, it’s prov­able, so .

The broader point about be­ing un­able to rea­son your­self out of a bad de­ci­sion if your prior for your own de­ci­sions doesn’t con­tain a “grain of truth” makes sense, but it’s not clear we can show that the agent in this ex­am­ple will definitely get stuck on the bad de­ci­sion—if any­thing, the above ar­gu­ment seems to show that the sys­tem has to be in­con­sis­tent! If that’s true, I would guess that the source of this in­con­sis­tency is as­sum­ing the agent has suffi­cient re­flec­tive ca­pac­ity to prove “If I can prove , then . Which would sug­gest learn­ing the les­son that it’s hard for agents to rea­son about their own be­havi­our with log­i­cal con­sis­tency.

• The agent has been con­structed such that Prov­able(“5 is the best pos­si­ble ac­tion”) im­plies that 5 is the best (only!) pos­si­ble ac­tion. Then by Löb’s the­o­rem, 5 is the only pos­si­ble ac­tion. It can­not also be si­mul­ta­neously con­structed such that Prov­able(“10 is the best pos­si­ble ac­tion”) im­plies that 10 is the only pos­si­ble ac­tion, be­cause then it would also fol­low that 10 is the only pos­si­ble ac­tion. That’s not just our proof sys­tem be­ing in­con­sis­tent, that’s false!

• (There was a LaTeX er­ror in my com­ment, which made it to­tally illeg­ible. But I think you man­aged to re­solve my con­fu­sion any­way).

I see! It’s not prov­able that Prov­able() im­plies . It seems like it should be prov­able, but the ob­vi­ous ar­gu­ment re­lies on the as­sump­tion that, if * is prov­able, then it’s not also prov­able that - in other words, that the proof sys­tem is con­sis­tent! Which may be true, but is not prov­able.

The asym­me­try be­tween 5 and 10 is that, to choose 5, we only need a proof that 5 is op­ti­mal, but to choose 10, we need to not find a proof that 5 is op­ti­mal. Which seems eas­ier than find­ing a proof that 10 is op­ti­mal, but is not prov­ably eas­ier.

• This tends to as­sume that we can de­tan­gle things enough to see out­comes as a func­tion of our ac­tions.

No. The as­sump­tion is that an agent has *agency* over some de­grees of free­dom of the en­vi­ron­ment. It’s not even an as­sump­tion, re­ally; it’s part of the defi­ni­tion of an agent. What is an agent with no agency?

If the agent’s ac­tions have no in­fluence on the state of the en­vi­ron­ment, then it can’t drive the state of the en­vi­ron­ment to satisfy any ob­jec­tive. The whole point of build­ing an in­ter­nal model of the en­vi­ron­ment is to un­der­stand how the agent’s ac­tions in­fluence the en­vi­ron­ment. In other words: “de­tan­gling things enough to see out­comes as func­tions of [the agent’s] ac­tions” isn’t just an as­sump­tion, it’s es­sen­tial.

The only point I can see in writ­ing the above sen­tence would be if you said that a func­tion isn’t, gen­er­ally; enough to de­scribe the re­la­tion­ship be­tween an agent’s ac­tions and the out­come: that you gen­er­ally need some higher-level con­struct like a Tur­ing ma­chine. That would be fair enough if it weren’t for the fact that the the­ory you’re com­par­ing yours to is AIXI which ex­plic­itly mod­els the re­la­tion­ship be­tween ac­tions and out­comes via Tur­ing ma­chines.

AIXI rep­re­sents the agent and the en­vi­ron­ment as sep­a­rate units which in­ter­act over time through clearly defined I/​O chan­nels so that it can then choose ac­tions max­i­miz­ing re­ward.

Do you pro­pose a model in which the re­la­tion­ship be­tween the agent and the en­vi­ron­ment are un­defined?

When the agent model is part of the en­vi­ron­ment model, it can be sig­nifi­cantly less clear how to con­sider tak­ing al­ter­na­tive ac­tions.

Really? It seems you’re ap­ply­ing mag­i­cal think­ing to the con­se­quences of em­bed­ding one Tur­ing ma­chine within an­other. Why would it’s I/​O or in­ter­nal mod­el­ing change so dras­ti­cally? If I use a vir­tual ma­chine to run Win­dows within Linux, does that make the ex­pe­rience of us­ing MS Paint fun­da­men­tally differ­ent then run­ning Win­dows in a na­tive boot?

...there can be other copies of the agent, or things very similar to the agent.
Depend­ing on how you draw the bound­ary around “your­self”, you might think you con­trol the ac­tion of both copies or only your own.

How is that un­clear? If the agent doesn’t ac­tu­ally con­trol the copies, then there’s no rea­son to imag­ine it does. If it’s try­ing to figure out how best to ex­er­cise its agency to satisfy its ob­jec­tive, then imag­in­ing it has any more agency than it ac­tu­ally does is silly. You don’t need to wan­der into the philo­soph­i­cal no-mans-land of defin­ing the “self”. It’s ir­rele­vant. What are your de­grees of free­dom? How can you uses them to satisfy your ob­jec­tive? At some point, the I/​O chan­nels *must be* well defined. It’s not like a pro­ces­sor has an am­bigu­ous num­ber of pins. It’s not like a hu­man has an am­bigu­ous num­ber of mo­tor neu­rons.

For all in­tents and pur­poses: the agent IS the de­grees of free­dom it con­trols. The agent can only change it’s state, which; be­ing a sub-set of the en­vi­ron­ment’s state, changes the en­vi­ron­ment in some way. You can’t lift a box, you can only change the po­si­tion of your arms. If that re­sults in a box be­ing lifted, good! Or maybe you can’t change the po­si­tion of those arms, you can only change the elec­tric po­ten­tial on some mo­tor neu­rons, if that re­sults in arms mov­ing, good! Play that game long enough and, at some point; the set of ac­tions you can do is finite and clearly defined.

Your five-or-ten prob­lem is one of many that demon­strate the brit­tle­ness prob­lem of logic-based sys­tems op­er­at­ing in the real world. This is well known. Peo­ple have all but aban­doned logic-based sys­tems for stochas­tic sys­tems when deal­ing with real-world prob­lems speci­fi­cally be­cause it’s effec­tively im­pos­si­ble to make a ro­bust logic-based sys­tem.

This is the crux of a lot of your dis­cus­sion. When you talk about an agent “know­ing” its own ac­tions or the “cor­rect­ness” of coun­ter­fac­tu­als, you’re talk­ing about defini­tive re­sults which a real-world agent would never have ac­cess to.

It’s pos­si­ble (though un­likely) for a cos­mic ray to dam­age your cir­cuits, in which case you could go right—but you would then be in­sane.

If a rare, spon­ta­neous oc­cur­rence causes you to go right, you must be in­sane? What? Is that re­ally the only con­clu­sion you could draw from that situ­a­tion? If I take a photo and a cos­mic ray causes one of the pix­els to reg­ister white, do I need to throw my cam­era out be­cause it might be “inasane”?!

Maybe we can force ex­plo­ra­tion ac­tions so that we learn what hap­pens when we do things?

First of all, who is “we” in this case? Are we the agent or are we some out­side sys­tem “forc­ing” the agent to ex­plore?

Ideally, no­body would have to force the agent to ex­plore its world. It would want to ex­plore and ex­per­i­ment as an in­stru­men­tal goal to lower un­cer­tainty in its model of the world so that it can bet­ter pur­sue its ob­jec­tive.

A bad prior can think that ex­plor­ing is dangerous

That’s not a bad prior. Ex­plor­ing *is* fun­da­men­tally dan­ger­ous. You’re en­coun­ter­ing the un­known. I’m not even sure if the risk/​re­ward ra­tio of ex­plor­ing is de­cid­able. It’s cer­tainly a hard prob­lem to de­ter­mine when it’s bet­ter to ex­plore, and when it’s too dan­ger­ous. Millions of the most so­phis­ti­cated biolog­i­cal neu­ral net­works the planet Earth has to offer have grap­pled with the ques­tion for hun­dreds of years with no clear an­swer.

Forc­ing it to take ex­plo­ra­tory ac­tions doesn’t teach it what the world would look like if it took those ac­tions de­liber­ately.

What? Again *who* is do­ing the “forc­ing” in this situ­a­tion and how? Do you re­ally want to tread into the other philo­soph­i­cal no-mans-land of free-will? Why would the ques­tion of whether the agent re­ally wanted to take an ac­tion have any bear­ing what­so­ever on the re­sult of that ac­tion? I’m so con­fused about what this sen­tence even means.

EDIT: It’s also un­clear to me the point of the dis­cus­sion on coun­ter­fac­tu­als. Coun­ter­fac­tu­als are of du­bi­ous util­ity for short-term eval­u­a­tion of out­comes. They be­come less use­ful the fur­ther you sep­a­rate the ac­tion from the re­sult in time. I could think, “damn! I should have taken an al­ter­nate route to work this morn­ing!” which is ar­guably use­ful and may ac­tu­ally be wrong, but if I think, “damn, if Eric the Red hadn’t sailed to the new world, Hitler would have never risen to power!” That’s not only ex­tremely ques­tion­able, but also what use would that pon­der­ing be even if it were cor­rect?

It seems like you’re say­ing an em­bed­ded agent can’t enu­mer­ate the pos­si­ble out­comes of its ac­tions be­fore tak­ing them, so it can only do so in ret­ro­spect. In which case, why can’t an em­bed­ded agent perform a pre-emp­tive tree search like any other agent? What’s the point of coun­ter­fac­tu­als?

• At some point, the I/​O chan­nels *must be* well defined.

This state­ment is pre­cisely what is be­ing challenged—and for good rea­son: it’s un­true. The rea­son it’s un­true is be­cause the con­cept of “I/​O chan­nels” does not ex­ist within physics as we know it; the true laws of physics make no refer­ence to in­puts, out­puts, or in­deed any kind of agents at all. In re­al­ity, that which is con­sid­ered a com­puter’s “I/​O chan­nels” are sim­ply ar­range­ments of mat­ter and en­ergy, the same as ev­ery­thing else in our uni­verse. There are no spe­cial XML tags at­tached to those con­figu­ra­tions of mat­ter and en­ergy, mark­ing them “in­put”, “out­put”, “pro­ces­sor”, etc. Such a no­tion is un­phys­i­cal.

Why might this dis­tinc­tion be im­por­tant? It’s im­por­tant be­cause an al­gorithm that is im­ple­mented on phys­i­cally ex­ist­ing hard­ware can be phys­i­cally dis­rupted. Any no­tion of agency which fails to ac­count for this pos­si­bil­ity—such as, for ex­am­ple, AIXI, which sup­poses that the only in­ter­ac­tion it has with the rest of the uni­verse is by ex­chang­ing bits of in­for­ma­tion via the in­put/​out­put chan­nels—will fail to con­sider the pos­si­bil­ity that its own op­er­a­tion may be dis­rupted. A phys­i­cal im­ple­men­ta­tion of AIXI would have no re­gard for the safety of its hard­ware, since it has no means of rep­re­sent­ing the fact that the de­struc­tion of its hard­ware equates to its own de­struc­tion.

AIXI also fails on var­i­ous de­ci­sion prob­lems that in­volve leak­ing in­for­ma­tion via a phys­i­cal side chan­nel that it doesn’t con­sider part of its out­put; for ex­am­ple, it has no re­gard for the ther­mal emis­sions it may pro­duce as a side effect of its com­pu­ta­tions. In the ex­treme case, AIXI is in­ca­pable of con­cep­tu­al­iz­ing the pos­si­bil­ity that an ad­ver­sar­ial agent may be able to in­spect its hard­ware, and hence “read its mind”. This re­flects a broader failure on AIXI’s part: it is in­ca­pable of rep­re­sent­ing an en­tire class of hy­pothe­ses—namely, hy­pothe­ses that in­volve AIXI it­self be­ing mod­eled by other agents in the en­vi­ron­ment. This is, again, be­cause AIXI is defined us­ing a frame­work that makes it un­phys­i­cal: the clas­si­cal defi­ni­tion of AIXI is un­com­putable, mak­ing it too “big” to be mod­eled by any (part) of the Tur­ing ma­chines in its hy­poth­e­sis space. This ap­plies even to com­putable for­mu­la­tions of AIXI, such as AIXI-tl: they have no way to rep­re­sent the pos­si­bil­ity of be­ing simu­lated by oth­ers, be­cause they as­sume they are too large to fit in the uni­verse.

I’m not sure what ex­actly is so hard to un­der­stand about this, con­sid­er­ing the origi­nal post con­veyed all of these ideas fairly well. It may be worth con­sid­er­ing the as­sump­tions you’re op­er­at­ing un­der—and in par­tic­u­lar, mak­ing sure that the post it­self does not vi­o­late those as­sump­tions—be­fore crit­i­ciz­ing said post based on those as­sump­tions.

• The rea­son it’s un­true is be­cause the con­cept of “I/​O chan­nels” does not ex­ist within physics as we know it.

Yes. They most cer­tainly do. The only truly con­sis­tent in­ter­pre­ta­tion I know of cur­rent physics is in­for­ma­tion the­o­retic any­way, but I’m not in­ter­ested in de­bat­ing any of that. The fact is I’m com­mu­ni­cat­ing to you with phys­i­cal I/​O chan­nels right now so I/​O chan­nels cer­tainly ex­ist in the real world.

the true laws of physics make no refer­ence to in­puts, out­puts, or in­deed any kind of agents at all.

Agents are emer­gent phe­nomenon. They don’t ex­ist on the level of par­ti­cles and waves. The con­cept is an ab­strac­tion.

“I/​O chan­nels” are sim­ply ar­range­ments of mat­ter and en­ergy, the same as ev­ery­thing else in our uni­verse. There are no spe­cial XML tags at­tached to those con­figu­ra­tions of mat­ter and en­ergy, mark­ing them “in­put”, “out­put”, “pro­ces­sor”, etc. Such a no­tion is un­phys­i­cal.

An I/​O chan­nel doesn’t im­ply mod­ern com­puter tech­nol­ogy. It just means in­for­ma­tion is col­lected from or im­printed upon the en­vi­ron­ment. It could be ant pheromones, it could be smoke sig­nals, its phys­i­cal im­ple­men­ta­tion is sec­ondary to the ab­stract con­cept of send­ing and re­ceiv­ing in­for­ma­tion of some kind. You’re not see­ing the for­est through the trees. In­for­ma­tion most cer­tainly does ex­ist.

Why might this dis­tinc­tion be im­por­tant? It’s im­por­tant be­cause an al­gorithm that is im­ple­mented on phys­i­cally ex­ist­ing hard­ware can be phys­i­cally dis­rupted. Any no­tion of agency which fails to ac­count for this pos­si­bil­ity—such as, for ex­am­ple, AIXI, which sup­poses that the only in­ter­ac­tion it has with the rest of the uni­verse is by ex­chang­ing bits of in­for­ma­tion via the in­put/​out­put chan­nels—will fail to con­sider the pos­si­bil­ity that its own op­er­a­tion may be dis­rupted.

I’ve ex­plained in pre­vi­ous posts that AIXI is a spe­cial case of AIXI_lt. AIXI_lt can be con­ceived of in an em­bed­ded con­text, in which case; its model of the world would in­clude a model of it­self which is sub­ject to any sort of en­vi­ron­men­tal dis­tur­bance.

To some ex­tent, an agent must trust its own op­er­a­tion to be cor­rect, be­cause you quickly run into in­finite re­gres­sion if the agent is mod­el­ing all the pos­si­ble that it could be malfunc­tion­ing. What if the malfunc­tion effects the way it mod­els the pos­si­ble ways it could malfunc­tion? It should model all the ways a malfunc­tion could dis­rupt how it mod­els all the ways it could malfunc­tion, right? It’s like say­ing “well the agent could malfunc­tion, so it should be aware that it can malfunc­tion so that it never malfunc­tions”. If the thing malfunc­tions, it malfunc­tions, it’s as sim­ple as that.

Aside from that, AIXI is meant to be a purely math­e­mat­i­cal for­mal­iza­tion, not a phys­i­cal im­ple­men­ta­tion. It’s an ab­strac­tion by de­sign. It’s meant to be used as a math­e­mat­i­cal tool for un­der­stand­ing in­tel­li­gence.

AIXI also fails on var­i­ous de­ci­sion prob­lems that in­volve leak­ing in­for­ma­tion via a phys­i­cal side chan­nel that it doesn’t con­sider part of its out­put; for ex­am­ple, it has no re­gard for the ther­mal emis­sions it may pro­duce as a side effect of its com­pu­ta­tions.

Do you con­sider how the 30 Watts leak­ing out of your head might effect your plans to ev­ery day? I mean, it might cause a ty­phoon in Tim­buktu! If you don’t con­sider how the waste heat pro­duced by your men­tal pro­cesses effect your en­vi­ron­ment while mak­ing long or short-term plans, you must not be a real in­tel­li­gent agent...

In the ex­treme case, AIXI is in­ca­pable of con­cep­tu­al­iz­ing the pos­si­bil­ity that an ad­ver­sar­ial agent may be able to in­spect its hard­ware, and hence “read its mind”.

AIXI can’t play tic-tac-toe with it­self be­cause that would mean it would have to model it­self as part of the en­vi­ron­ment which it can’t do. Yes, I know there are fun­da­men­tal prob­lems with AIXI...

This is, again, be­cause AIXI is defined us­ing a frame­work that makes it unphysical

No. It’s fine to for­mal­ize some­thing math­e­mat­i­cally. Peo­ple do it all the time. Math is a perfectly valid tool to in­ves­ti­gate phe­nom­ena. The prob­lem with AIXI proper, is that it’s limited to a con­text in which the agent and en­vi­ron­ment are in­de­pen­dent en­tities. There are ac­tu­ally prob­lems where that is a de­cent ap­prox­i­ma­tion, but it would be bet­ter to have a more gen­eral for­mu­la­tion, like AIXI_lt that can be ap­plied to con­texts in which an agent is em­bed­ded in its en­vi­ron­ment.

This ap­plies even to com­putable for­mu­la­tions of AIXI, such as AIXI-tl: they have no way to rep­re­sent the pos­si­bil­ity of be­ing simu­lated by oth­ers, be­cause they as­sume they are too large to fit in the uni­verse.

That’s sim­ply not true.

I’m not sure what ex­actly is so hard to un­der­stand about this, con­sid­er­ing the origi­nal post con­veyed all of these ideas fairly well. It may be worth con­sid­er­ing the as­sump­tions you’re op­er­at­ing un­der—and in par­tic­u­lar, mak­ing sure that the post it­self does not vi­o­late those as­sump­tions—be­fore crit­i­ciz­ing said post based on those as­sump­tions.

I didn’t make any as­sump­tions. I said what I be­lieve to be cor­rect.

I’d love to hear you or the au­thor ex­plain how an agent is sup­posed to make de­ci­sions about what to do in an en­vi­ron­ment if it’s agency is com­pletely un­defined.

I’d also love to hear your thoughts on the re­la­tion­ship be­tween math, sci­ence, and the real world if you think com­par­ing a phys­i­cal im­ple­men­ta­tion to a math­e­mat­i­cal for­mal­iza­tion is any more fruit­ful than com­par­ing ap­ples to or­anges.

Did you know that en­g­ineers use the “ideal gas law” ev­ery day to solve real-world prob­lems even though they know that no real-world gas ac­tu­ally fol­lows the “ideal gas law”?! You should go tell them that they’re do­ing it wrong!

• The con­cept is an ab­strac­tion.

*Yes, it is. The fact that it is an ab­strac­tion is pre­cisely why it breaks down un­der cer­tain cir­cum­stances.

An I/​O chan­nel doesn’t im­ply mod­ern com­puter tech­nol­ogy. It just means in­for­ma­tion is col­lected from or im­printed upon the en­vi­ron­ment. It could be ant pheromones, it could be smoke sig­nals, its phys­i­cal im­ple­men­ta­tion is sec­ondary to the ab­stract con­cept of send­ing and re­ceiv­ing in­for­ma­tion of some kind. You’re not see­ing the for­est through the trees. In­for­ma­tion most cer­tainly does ex­ist.

The claim is not that “in­for­ma­tion” does not ex­ist. The claim is that in­put/​out­put chan­nels are in fact an ab­strac­tion over more fun­da­men­tal phys­i­cal con­figu­ra­tions. Noth­ing you wrote con­tra­dicts this, so the fact that you seem to think what I wrote was some­how in­cor­rect is puz­zling.

I’ve ex­plained in pre­vi­ous posts that AIXI is a spe­cial case of AIXI_lt. AIXI_lt can be con­ceived of in an em­bed­ded con­text,

Yes.

in which case; its model of the world would in­clude a model of it­self which is sub­ject to any sort of en­vi­ron­men­tal disturbance

*No. AIXI-tl ex­plic­itly does not model it­self or seek to iden­tify it­self with any part of the Tur­ing ma­chines in its hy­poth­e­sis space. The very con­cept of self-mod­el­ing is en­tirely ab­sent from AIXI’s defi­ni­tion, and AIXI-tl, be­ing a var­i­ant of AIXI, does not in­clude said con­cept ei­ther.

To some ex­tent, an agent must trust its own op­er­a­tion to be cor­rect, be­cause you quickly run into in­finite re­gres­sion if the agent is mod­el­ing all the pos­si­ble that it could be malfunc­tion­ing. What if the malfunc­tion effects the way it mod­els the pos­si­ble ways it could malfunc­tion? It should model all the ways a malfunc­tion could dis­rupt how it mod­els all the ways it could malfunc­tion, right? It’s like say­ing “well the agent could malfunc­tion, so it should be aware that it can malfunc­tion so that it never malfunc­tions”. If the thing malfunc­tions, it malfunc­tions, it’s as sim­ple as that.

*This is cor­rect, so far as it goes, but what you ne­glect to men­tion is that AIXI makes no at­tempt to pre­serve its own hard­ware. It’s not just a mat­ter of “malfunc­tion­ing”; hu­mans can “malfunc­tion” as well. How­ever, the differ­ence be­tween hu­mans and AIXI is that we un­der­stand what it means to die, and go out of our way to make sure our bod­ies are not put in un­due dan­ger. Mean­while, AIXI will hap­pily al­lows its hard­ware to be de­stroyed in ex­change for the tiniest in­crease in re­ward. I don’t think I’m be­ing un­fair when I sug­gest that this be­hav­ior is ex­tremely un­nat­u­ral, and is not the kind of thing most peo­ple in­tu­itively have in mind when they talk about “in­tel­li­gence”.

Aside from that, AIXI is meant to be a purely math­e­mat­i­cal for­mal­iza­tion, not a phys­i­cal im­ple­men­ta­tion. It’s an ab­strac­tion by de­sign. It’s meant to be used as a math­e­mat­i­cal tool for un­der­stand­ing in­tel­li­gence.

*Ab­strac­tions are use­ful for their in­tended pur­pose, noth­ing more. AIXI was for­mu­lated as an at­tempt to de­scribe an ex­tremely pow­er­ful agent, per­haps the most pow­er­ful agent pos­si­ble, and it serves that pur­pose ad­mirably so long as we re­strict anal­y­sis to prob­lems in which the agent and the en­vi­ron­ment can be cleanly sep­a­rated. As soon as that re­stric­tion is re­moved, how­ever, it’s ob­vi­ous that the AIXI for­mal­ism fails to cap­ture var­i­ous in­tu­itively de­sir­able be­hav­iors (e.g. self-preser­va­tion, as dis­cussed above). As a tool for rea­son­ing about agents in the real world, there­fore, AIXI is of limited use­ful­ness. I’m not sure why you find this idea ob­jec­tion­able; surely you un­der­stand that all ab­strac­tions have their limits?

Do you con­sider how the 30 Watts leak­ing out of your head might effect your plans to ev­ery day? I mean, it might cause a ty­phoon in Tim­buktu! If you don’t con­sider how the waste heat pro­duced by your men­tal pro­cesses effect your en­vi­ron­ment while mak­ing long or short-term plans, you must not be a real in­tel­li­gent agent...

In­deed, you are cor­rect that waste heat is not much of a fac­tor when it comes to hu­mans. How­ever, that does not mean that the same holds true for ad­vanced agents run­ning on pow­er­ful hard­ware, es­pe­cially if such agents are in­ter­act­ing with each other; who knows what can be de­duced from var­i­ous side out­puts, if a su­per­in­tel­li­gence is do­ing the de­duc­ing? Re­gard­less of the an­swer, how­ever, one thing is clear: AIXI does not care.

This seems to ad­dress the ma­jor­ity of your points, and the last few para­graphs of your com­ment seem mainly to be re­it­er­at­ing/​elab­o­rat­ing on those points. As such, I’ll re­frain from re­ply­ing in de­tail to ev­ery­thing else, in or­der not to make this com­ment longer than it already is. If you re­spond to me, you needn’t feel obli­gated to re­ply to ev­ery in­di­vi­d­ual point I made, ei­ther. I marked what I view as the most im­por­tant points of dis­agree­ment with an as­ter­isk*, so if you’re short on time, feel free to re­spond only to those.

• In the al­ter­na­tive al­gorithm for the five-and-ten prob­lem, why should we use the first proof that we find? How about this al­gorithm:

A2 :=
Spend some time t search­ing for proofs of sen­tences of the form
“A2() = a → U() = x”
for a ∈ {5, 10}, x ∈ {0, 5, 10}.
For each found proof and cor­re­spond­ing pair (a, x):
if x > x*:
a* := a
x* := x
Re­turn x*


If this one searches long enough (de­pend­ing on how com­pli­cated U is), it will re­turn 10, even if the non-spu­ri­ous proofs are longer than the spu­ri­ous ones.

• I guess then it would have to prove that it will find a proof with x > 0 within t. This is difficult.