# Decision Theory

(A longer text-based ver­sion of this post is also available on MIRI’s blog here, and the bibliog­ra­phy for the whole se­quence can be found here.)

The next post in this se­quence, ‘Embed­ded Agency’, will come out on Fri­day, Novem­ber 2nd.

To­mor­row’s AI Align­ment Fo­rum se­quences post will be ‘What is Am­bi­tious Value Learn­ing?’ in the se­quence ‘Value Learn­ing’.

• Cross-post­ing some com­ments from the MIRI Blog:

Kon­stantin Surkov:

Re: 510 problem
I don’t get it. Hu­man is ob­vi­ously (in that re­gard) an agent rea­son­ing about his ac­tions. Hu­man also will choose 10 with­out any difficulty. What in hu­man de­ci­sion mak­ing pro­cess is not for­mal­iz­able here? As­sum­ing we agree that 10 is ra­tio­nal choice.

Abram Dem­ski:

Sup­pose you know that you take the $10. How do you rea­son about what would hap­pen if you took the$5 in­stead? It seems easy if you know how to sep­a­rate your­self from the world, so that you only think of ex­ter­nal con­se­quences (get­ting $5). If you think about your­self as well, then you run into con­tra­dic­tions when you try to imag­ine the world where you take the$5, be­cause you know it is not the sort of thing you would do. Maybe you have some ab­surd pre­dic­tions about what the world would be like if you took the $5; for ex­am­ple, you imag­ine that you would have to be blind. That’s alright, though, be­cause in the end you are tak­ing the$10, so you’re do­ing fine.
Part of the point is that an agent can be in a similar po­si­tion, ex­cept it is tak­ing the $5, knows it is tak­ing the$5, and un­able to figure out that it should be tak­ing the $10 in­stead due to the ab­surd pre­dic­tions it makes about what hap­pens when it takes the$10. It seems kind of hard for a hu­man to end up in that situ­a­tion, but it doesn’t seem so hard to get this sort of thing when we write down for­mal rea­son­ers, par­tic­u­larly when we let them rea­son about them­selves fully (as nat­u­ral parts of the world) rather than only rea­son­ing about the ex­ter­nal world or hav­ing pre-pro­grammed di­vi­sions (so they rea­son about them­selves in a differ­ent way from how they rea­son about the world).
• Sure, one can imag­ine hy­po­thet­i­cally tak­ing $5, even if in re­al­ity they would take$10. That’s a spu­ri­ous out­put from a differ­ent al­gorithm al­to­gether. it as­sumes the world where you are not the same per­son who takes $10. So, it would make sense to ex­am­ine which of the two you are, if you don’t yet know that you will take$10, but not if you already know it. Which of the two is it?

• I’m not con­vinced that an in­con­se­quen­tial grain of un­cer­tainty couldn’t han­dle this 5-10 prob­lem. Con­sider an agent whose ac­tions are prob­a­bil­ity dis­tri­bu­tions on {5,10} that are nowhere 0. We can call these points in the open af­fine space spanned by the points 5 and 10. U is then a lin­ear func­tion from this af­fine space to util­ities. The agent would search for proofs that U is some par­tic­u­lar such lin­ear func­tion. Once it finds one, it uses that lin­ear func­tion to com­pute the op­ti­mal ac­tion. To en­sure that there is an op­ti­mum, we can ad­join in­finites­i­mal val­ues to the pos­si­ble prob­a­bil­ities and util­ities.

If the agent were to find a proof that the lin­ear func­tion is the one in­duced by map­ping 5 to 5 and 10 to 0, it would re­turn (1-ε)⋅5+ε⋅10 and get util­ity 5+5ε in­stead of the ex­pected 5-5ε, so Löb’s the­o­rem wouldn’t make this self-fulfilling.

• So, your sug­ges­tion is not just an in­con­se­quen­tial grain of un­cer­tainty, it is an grain of ex­plo­ra­tion. The agent ac­tu­ally does take 10 with some small prob­a­bil­ity. If you try to do this with just un­cer­tainty, things would be worse, since that un­cer­tainty would not be jus­tified.

One prob­lem is that you ac­tu­ally do ex­plore a bunch, and since you don’t get a re­set but­ton, you will some­times ex­plore into ir­re­versible ac­tions, like shut­ting your­self off. How­ever, if the agent has a source of ran­dom­ness, and also the abil­ity to simu­late wor­lds in which that ran­dom­ness went an­other way, you can have an agent that with prob­a­bil­ity does not ex­plore ever, and learns from the other wor­lds in which it does ex­plore. So, you can ei­ther ex­plore for­ever, and shut your­self off, or you can ex­plore very very rarely and learn from other pos­si­ble wor­lds.

The prob­lem with learn­ing from other pos­si­ble wor­lds is to get good re­sults out of it, you have to as­sume that the en­vi­ron­ment does not also learn from other pos­si­ble wor­lds, which is not very em­bed­ded.

But you are sug­gest­ing ac­tu­ally ex­plor­ing a bunch, and there is a prob­lem other than just shut­ting your­self off. You are get­ting past this prob­lem in this case by only al­low­ing lin­ear func­tions, but that is not an ac­cu­rate as­sump­tion. Let’s say you are play­ing match­ing pen­nies with Omega, who has the abil­ity to pre­dict what prob­a­bil­ity you will pick but not what ac­tion you will pick.

(In match­ing pen­nies, you each choose H or T, you win if they match, they win if they don’t.)

Omega will pick H if your prob­a­bil­ity of H is less that 12 and T oth­er­wise. Your util­ity as a func­tion of prob­a­bil­ity is piece­wise lin­ear with two parts. Try­ing to as­sume that it will be lin­ear will make things messy.

There is this prob­lem where some­times the out­come of ex­plor­ing into tak­ing 10, and the out­come of ac­tu­ally tak­ing 10 be­cause it is good are differ­ent. More on this here.

• I am talk­ing about the sur­real num­ber ε, which is smaller than any pos­i­tive real. Events of like­li­hood ε do not ac­tu­ally hap­pen, we just keep them around so the coun­ter­fac­tual rea­son­ing does not di­vide by 0.

Within the simu­la­tion, the AI might be able to con­clude that it just made an ε-like­li­hood de­ci­sion and must there­fore be in a coun­ter­fac­tual simu­la­tion. It should of course carry on as it were, in or­der to help the simu­lat­ing ver­sion of it­self.

Why shouldn’t the en­vi­ron­ment be learn­ing?

To the Omega sce­nario I would say that since we have an Omega-proof ran­dom num­ber gen­er­a­tor, we get new strate­gic op­tions that should be in­cluded in the available ac­tions. The lin­ear func­tion then goes from the ε-ad­joined open af­fine space gen­er­ated by {Pick H with prob­a­bil­ity p | p real, non-nega­tive and at most 1} to the ε-ad­joined util­ities, and we cor­rectly solve Omega’s prob­lem by us­ing p=1/​2.

• Yeah, so its like you have this pri­vate data, which is an in­finite se­quence of bits, and if you see all 0′s you take an ex­plo­ra­tion ac­tion. I think that by giv­ing the agent these pri­vate bits and promis­ing that the bits do not change the rest of the world, you are es­sen­tially giv­ing the agent ac­cess to a causal coun­ter­fac­tual that you con­structed. You don’t even have to mix with what the agent ac­tu­ally does, you can ex­plore with ev­ery ac­tion and ask if it is bet­ter to ex­plore and take 5 or ex­plore and take 10. By do­ing this, you are es­sen­tially giv­ing the agent ac­cess to a causal coun­ter­fac­tual, be­cause con­di­tion­ing on these in­finites­i­mals is ba­si­cally like com­ing in and chang­ing what the agent does. I think giv­ing the agent a true source of ran­dom­ness ac­tu­ally does let you im­ple­ment CDT.

If the en­vi­ron­ment learns from the other pos­si­ble wor­lds, It might pun­ish or re­ward you in one world for stuff that you do in the other world, so you cant just ask which world is best to figure out what to do.

I agree that that is how you want to think about the match­ing pen­nies prob­lem. How­ever the point is that your pro­posed solu­tion as­sumed lin­ear­ity. It didn’t em­piri­cally ob­serve lin­ear­ity. You have to be able to tell the differ­ence be­tween the situ­a­tions in or­der to know not to as­sume lin­ear­ity in the match­ing pen­nies prob­lem. The method for tel­ling the differ­ence is how you de­ter­mine whether or not and in what ways you have log­i­cal con­trol over Omega’s pre­dic­tion of you.

• I posit that lin­ear­ity always holds. In a de­ter­minis­tic uni­verse, the lin­ear func­tion is be­tween the ε-ad­joined open af­fine space gen­er­ated by our prim­i­tive set of ac­tions and the ε-ad­joined util­ities. (Like in my first com­ment.)

In a prob­a­bil­is­tic uni­verse, the lin­ear func­tion is be­tween the ε-ad­joined open af­fine space gen­er­ated by (the set of points in) the closed af­fine space gen­er­ated by our prim­i­tive set of ac­tions and the ε-ad­joined util­ities. (Like in my sec­ond com­ment.)

I got from one of your com­ments that as­sum­ing lin­ear­ity wards off some prob­lem. Does it come back in the prob­a­bil­is­tic-uni­verse case?

• My point was that I don’t know where to as­sume the lin­ear­ity is. When­ever I have pri­vate ran­dom­ness, I have lin­ear­ity over what I end up choos­ing with that ran­dom­ness, but not lin­ear­ity over what prob­a­bil­ity I choose. But I think this is non get­ting at the dis­agree­ment, so I pivot to:

In your model, what does it mean to prove that U is some lin­ear af­fine func­tion? If I prove that my prob­a­bil­ity p is 12 and that U=7.5, have I proven that U is the con­stant func­tion 7.5? If there is only one value of p, it is not defined what the util­ity func­tion is, un­less I suc­cess­fully carve the uni­verse in such a way as to let me re­place the ac­tion with var­i­ous things and see what hap­pens. (or, as­sum­ing lin­ear­ity re­place the prob­a­bil­ity with enough lin­early in­de­pen­dent things (in this case 2) to define the func­tion.

• In the match­ing pen­nies game, would be proven to be . A could max­i­mize this by re­turn­ing ε when isn’t , and (where ε is so small that this is still in­finites­i­mally close to 1) when is .

The lin­ear­ity is always in the func­tion be­tween ε-ad­joined open af­fine spaces. Whether the util­ities also end up lin­ear in the closed af­fine space (ie no­body cares about our rea­son­ing pro­cess) is for the ob­ject-level in­for­ma­tion gath­er­ing pro­cess to de­duce from the en­vi­ron­ment.

You never prove that you will with cer­tainty de­cide . You always leave a so-you’re-say­ing-there’s-a chance of ex­plo­ra­tion, which pro­duces a grain of un­cer­tainty. To ex­e­cute the ac­tion, you in­spect the cer­e­mo­nial Boltz­mann Bit (which is im­ple­mented by be­ing con­stantly set to “dis­card the ε“), but which you treat as hav­ing an ε chance of flip­ping.

The self-mod­ifi­ca­tion mod­ule could note that in­spect­ing that bit is a no-op, see that re­mov­ing it would make the coun­ter­fac­tual rea­son­ing mod­ule crash, and leave up the Ch­ester­ton fence.

• But how do you avoid prov­ing with cer­tainty that p=1/​2?

Since your pro­posal does not say what to do if you find in­con­sis­tent proofs that the lin­ear func­tion is two differ­ent things, I will as­sume that if it finds mul­ti­ple differ­ent proofs, it de­faults to 5 for the fol­low­ing.

Here is an­other ex­am­ple:

You are in a 5 and 10 prob­lem. You have twin that is also in a 5 and 10 prob­lem. You have ex­actly the same source code. There is a con­sis­tency checker, and if you and your twin do differ­ent things, you both get 0 util­ity.

You can prove that you and your twin do the same thing. Thus you can prove that the func­tion is 5+5p. You can also prove that your twin takes 5 by Lob’s the­o­rem. (You can also prove that you take 5 by Lob’s the­o­rem, but you ig­nore that proof, since “there is always a chance”) Thus, you can prove that the func­tion is 5-5p. Your sys­tem doesn’t know what to do with two func­tions, so it de­faults to 5. (If it is prov­able that you both take 5, you both take 5, com­plet­ing the proof by Lob.)

I am do­ing the same thing as be­fore, but be­cause I put it out­side of the agent, it does not get flagged with the “there is always a chance” mod­ule. This is try­ing to illus­trate that your pro­posal takes ad­van­tage of a sep­a­ra­tion be­tween the agent and the en­vi­ron­ment that was snuck in, and could be done in­cor­rectly.

Two pos­si­ble fixes:

1) You could say that the agent, in­stead of tak­ing 5 when find­ing in­con­sis­tency takes some ac­tion that ex­hibits the in­con­sis­tency (some­thing that the two func­tions give differ­ent val­ues). This is very similar to the chicken rule, and if you add some­thing like this, you don’t re­ally need the rest of your sys­tem. If you take an agent that when­ever it proves it does some­thing, it does some­thing else. This agent will prove (given enough time) that if it takes 5 it gets 5, and if it takes 10 it gets 10.

2) I had one proof sys­tem, and just ig­nored the proofs that I found that I did a thing. I could in­stead give the agent a spe­cial proof sys­tem that is in­ca­pable of prov­ing what it does, but how do you do that? Chicken rule seems like the place to start.

One prob­lem with the chicken rule is that it was de­vel­oped in a sys­tem that was de­duc­tively closed, so you can’t prove some­thing that passes though a proof of P with­out prov­ing P. If you vi­o­late this, by hav­ing a ran­dom the­o­rem prover, you might have an sys­tem that fails to prove “I take 5” but proves “I take 5 and 1+1=2″ and uses this to com­plete the Lob loop.

• I can’t prove what I’m go­ing to do and I can’t prove that I and the twin are go­ing to do the same thing, be­cause of the Boltz­mann Bits in both of our de­ci­sion-mak­ers that might turn out differ­ent ways. But I can prove that we have a chance of do­ing the same thing, and my ex­pected util­ity is , round­ing to once it ac­tu­ally hap­pens.

• Con­tent feed­back:

The Pre­face to the Se­quence on Value Learn­ing con­tains the fol­low­ing ad­vice on re­search di­rec­tions for that se­quence:

If you try to dis­prove the ar­gu­ments in the posts, or to cre­ate for­mal­isms that sidestep the is­sues brought up, you may very well gen­er­ate a new in­ter­est­ing di­rec­tion of work that has not been con­sid­ered be­fore.

This pro­vides spe­cific di­rec­tion on what to look at and what work needs done. If such a state­ment for this se­quence is pos­si­ble, I think it would be valuable to in­clude.

• If you know your own ac­tions, why would you rea­son about tak­ing differ­ent ac­tions? Wouldn’t you rea­son about some­one who is al­most like you, but just differ­ent enough to make a differ­ent choice?

• Sure. How do you do that?

• No­tice (well, you already know that) that ac­cept­ing that iden­ti­cal agents make iden­ti­cal de­ci­sions (su­per­ra­tional­ity, as it were) and to make differ­ent de­ci­sions in iden­ti­cal cir­cum­stances the agents must nec­es­sar­ily be differ­ent, gets you out of many pick­les. For ex­am­ple, in the 5&10 game an agent would ex­am­ine its own al­gorithm, see that it leads to tak­ing $10 and stop there. There is no “what would hap­pen if you took a differ­ent ac­tion”, be­cause the agent tak­ing a differ­ent ac­tion would not be you, not ex­actly. So, no Lo­bian ob­sta­cle. In re­turn, you give up some­thing a lot more emo­tion­ally valuable: the delu­sion of mak­ing con­scious de­ci­sions. Pick your poi­son. • For ex­am­ple, in the 5&10 game an agent would ex­am­ine its own al­gorithm, see that it leads to tak­ing$10 and stop there.

Why do even that much if this rea­son­ing could not be used? The ques­tion is about the rea­son­ing that could con­tribute to the de­ci­sion, that could de­scribe the al­gorithm, and so has the op­tion to not “stop there”. What if you see that your al­gorithm leads to tak­ing the $10 and in­stead of stop­ping there, you take the$5?

Noth­ing stops you. This is the “chicken rule” and it solves some is­sues, but more im­por­tantly illus­trates the pos­si­bil­ity in how a de­ci­sion al­gorithm can func­tion. The fact that this is a thing is ev­i­dence that there may be some­thing wrong with the “stop there” pro­posal. Speci­fi­cally, you usu­ally don’t know that your rea­son­ing is ac­tual, that it’s even log­i­cally pos­si­ble and not part of an im­pos­si­ble coun­ter­fac­tual, but this is not a hope­less hy­po­thet­i­cal where noth­ing mat­ters. Noth­ing com­pels you to af­firm what you know about your ac­tions or con­clu­sions, this is not a ne­ces­sity in a de­ci­sion mak­ing al­gorithm, but differ­ent things you do may have an im­pact on what hap­pens, be­cause the situ­a­tion may be ac­tual af­ter all, de­pend­ing on what hap­pens or what you de­cide, or it may be pre­dicted from within an ac­tual situ­a­tion and in­fluence what hap­pens there. This mo­ti­vates learn­ing to rea­son in and about pos­si­bly im­pos­si­ble situ­a­tions.

What if you ex­am­ine your al­gorithm and find that it takes the $5 in­stead? It could be the same al­gorithm that takes the$10, but you don’t know that, in­stead you ar­rive at the $5 con­clu­sion us­ing rea­son­ing that could be im­pos­si­ble, but that you don’t know to be im­pos­si­ble, that you haven’t de­cided yet to make im­pos­si­ble. One way to solve the is­sue is to ren­der the situ­a­tion where that holds im­pos­si­ble, by con­tra­dict­ing the con­clu­sion with your ac­tion, or in some other way. To know when to do that, you should be able to rea­son about and within such situ­a­tions that could be im­pos­si­ble, or could be made im­pos­si­ble, in­clud­ing by the de­ci­sions made in them. This makes the way you rea­son in them rele­vant, even when in the end these situ­a­tions don’t oc­cur, be­cause you don’t a pri­ori know that they don’t oc­cur. (The 5-and-10 prob­lem is not speci­fi­cally about this is­sue, and ex­plicit rea­son­ing about im­pos­si­ble situ­a­tions may be avoided, per­haps should be avoided, but my guess is that the crux in this com­ment thread is about things like use­ful­ness of rea­son­ing from within pos­si­bly im­pos­si­ble situ­a­tions, where even your own knowl­edge ar­rived at by pure com­pu­ta­tion isn’t nec­es­sar­ily cor­rect.) • Thank you for your ex­pla­na­tion! Still try­ing to un­der­stand it. I un­der­stand that there is no point ex­am­in­ing one’s al­gorithm if you already ex­e­cute it and see what it does. What if you see that your al­gorithm leads to tak­ing the$10 and in­stead of stop­ping there, you take the $5? I don’t un­der­stand that point. you say “noth­ing stops you”, but that is only pos­si­ble if you could act con­trary to your own al­gorithm, no? Which makes no sense to me, un­less the same al­gorithm gives differ­ent out­comes for differ­ent in­puts, e.g. “if I sim­ply run the al­gorithm, I take$10, but if I ex­am­ine the al­gorithm be­fore run­ning it and then run it, I take $5″. But it doesn’t seem like the thing you mean, so I am con­fused. What if you ex­am­ine your al­gorithm and find that it takes the$5 in­stead?

How can it be pos­si­ble? if your ex­am­i­na­tion of your al­gorithm is ac­cu­rate, it gives the same out­come as mind­lessly run­ning it, with is tak­ing $10, no? It could be the same al­gorithm that takes the$10, but you don’t know that, in­stead you ar­rive at the $5 con­clu­sion us­ing rea­son­ing that could be im­pos­si­ble, but that you don’t know to be im­pos­si­ble, that you haven’t de­cided yet to make im­pos­si­ble. So your rea­son­ing is in­ac­cu­rate, in that you ar­rive to a wrong con­clu­sion about the al­gorithm out­put, right? You just don’t know where the er­ror lies, or even that there is an er­ror to be­gin with. But in this case you would ar­rive to a wrong con­clu­sion about the same al­gorithm run by a differ­ent agent, right? So there is noth­ing spe­cial about it be­ing your own al­gorithm and not some­one else’s. If so, the is­sue is re­duced to find­ing an ac­cu­rate al­gorithm anal­y­sis tool, for an al­gorithm that demon­stra­bly halts in a very short time, pro­duc­ing one of the two pos­si­ble out­comes. This seems to have lit­tle to do with de­ci­sion the­ory is­sues, so I am lost as to how this is rele­vant to the situ­a­tion. I am clearly miss­ing some of your logic here, but I still have no idea what the miss­ing piece is, un­less it’s the liber­tar­ian free will thing, where one can act con­trary to one’s pro­gram­ming. Any fur­ther help would be greatly ap­pre­ci­ated. • I un­der­stand that there is no point ex­am­in­ing one’s al­gorithm if you already ex­e­cute it and see what it does. Rather there is no point if you are not go­ing to do any­thing with the re­sults of the ex­am­i­na­tion. It may be use­ful if you make the de­ci­sion based on what you ob­serve (about how you make the de­ci­sion). you say “noth­ing stops you”, but that is only pos­si­ble if you could act con­trary to your own al­gorithm, no? You can, for a cer­tain value of “can”. It won’t have hap­pened, of course, but you may still de­cide to act con­trary to how you act, two differ­ent out­comes of the same al­gorithm. The con­tra­dic­tion proves that you didn’t face the situ­a­tion that trig­gers it in ac­tu­al­ity, but the con­tra­dic­tion re­sults pre­cisely from de­cid­ing to act con­trary to the ob­served way in which you act, in a situ­a­tion that a pri­ori could be ac­tual, but is ren­dered coun­ter­log­i­cal as a re­sult of your de­ci­sion. If in­stead you af­firm the ob­served ac­tion, then there is no con­tra­dic­tion and so it’s pos­si­ble that you have faced the situ­a­tion in ac­tu­al­ity. Thus the “chicken rule”, play­ing chicken with the uni­verse, mak­ing the pre­sent situ­a­tion im­pos­si­ble when you don’t like it. So your rea­son­ing is inaccurate You don’t know that it’s in­ac­cu­rate, you’ve just run the com­pu­ta­tion and it said$5. Maybe this didn’t ac­tu­ally hap­pen, but you are con­sid­er­ing this situ­a­tion with­out know­ing if it’s ac­tual. If you ig­nore the com­pu­ta­tion, then why run it? If you run it, you need re­sponses to all pos­si­ble re­sults, and all pos­si­ble re­sults ex­cept one are not ac­tual, yet you should be ready to re­spond to them with­out know­ing which is which. So I’m dis­cussing what you might do for the re­sult that says that you take the $5. And in the end, the use you make of the re­sults is by choos­ing to take the$5 or the $10. This map from pre­dic­tions to de­ci­sions could be any­thing. It’s triv­ial to write an al­gorithm that in­cludes such a map. Of course, if the map di­ag­o­nal­izes, then the pre­dic­tor will fail (won’t give a pre­dic­tion), but the map is your rea­son­ing in these hy­po­thet­i­cal situ­a­tions, and the fact that the map may say any­thing cor­re­sponds to the fact that you may de­cide any­thing. The map doesn’t have to be iden­tity, de­ci­sion doesn’t have to re­flect pre­dic­tion, be­cause you may write an al­gorithm where it’s not iden­tity. • You can, for a cer­tain value of “can”. It won’t have hap­pened, of course, but you may still de­cide to act con­trary to how you act, two differ­ent out­comes of the same al­gorithm. This con­fuses me even more. You can imag­ine act con­trary to your own al­gorithm, but the imag­in­ing differ­ent pos­si­ble out­comes is a side effect of run­ning the main al­gorithm that takes$10. It is never the out­come of it. Or an out­come. Since you know you will end up tak­ing $10, I also don’t un­der­stand the idea of play­ing chicken with the uni­verse. Are there any refer­ences for it? You don’t know that it’s in­ac­cu­rate, you’ve just run the com­pu­ta­tion and it said$5.

Wait, what? We started with the as­sump­tion that ex­am­in­ing the al­gorithm, or run­ning it, shows that you will take $10, no? I guess I still don’t un­der­stand how What if you see that your al­gorithm leads to tak­ing the$10 and in­stead of stop­ping there, you take the \$5?

is even pos­si­ble, or worth con­sid­er­ing.

This map from pre­dic­tions to de­ci­sions could be any­thing.

Hmm, maybe this is where I miss some of the logic. If the pre­dic­tions are ac­cu­rate, the map is bi­jec­tive. If the pre­dic­tions are in­ac­cu­rate, you need a bet­ter al­gorithm anal­y­sis tool.

The map doesn’t have to be iden­tity, de­ci­sion doesn’t have to re­flect pre­dic­tion, be­cause you may write an al­gorithm where it’s not iden­tity.

To me this screams “get a bet­ter al­gorithm an­a­lyzer!” and has noth­ing to do with whether it’s your own al­gorithm, or some­one else’s. Can you maybe give an ex­am­ple where one ends up in a situ­a­tion where there is no ob­vi­ous al­gorithm an­a­lyzer one can ap­ply?

• It was not un­til read­ing this that I re­ally un­der­stood that I am in the habit of rea­son­ing about my­self as just a part of the en­vi­ron­ment.

• The kicker is that we don’t rea­son di­rectly about our­selves as such, we use a sim­plified model of our­selves. And we’re REALLY GOOD at us­ing that model for causal rea­son­ing, even when it is re­flec­tive, and in­volves mul­ti­ple lev­els of self-re­flec­tion and coun­ter­fac­tu­als—at least when we bother to try. (We try rarely be­cause ex­plicit mod­el­ling is cog­ni­tively de­mand­ing, and we usu­ally use de­faults /​ con­di­tioned rea­son­ing. Some­times that’s OK.)

Ex­am­ple: It is 10PM. A 5-page re­port is due in 12 hours, at 10AM.

De­fault: Go to sleep at 1AM, set alarm for 8AM. Re­sult: Don’t finish re­port tonight, have too lit­tle time to do so to­mor­row.

Con­di­tioned rea­son­ing: Stay up to finish the re­port first. 5 hours of work, and stay up un­til 3AM. Re­sult? Write bad re­port, still feel ex­hausted the next day

Coun­ter­fac­tual rea­son­ing: I should nap /​ get some amount of sleep so that I am bet­ter able to con­cen­trate, which will out­weigh the lost time. I could set my alarm for any amount of time; what amount does my model of my­self im­ply will lead to an op­ti­mal well-rested /​ suffi­cient time trade-off?

Self-re­flec­tion prob­lem, sec­ond use of mini-self model: I’m worse at rea­son­ing at 1AM than I am at 10PM. I should de­cide what to do now, in­stead of de­lay­ing un­til then. I think go­ing to sleep at 12AM and wak­ing at 3AM gives me enough rest and time to do a good job on the re­port.

Con­sider coun­ter­fac­tual and im­pact: How does this im­pact the rest of my week’s sched­ule? 3 hours is lo­cally op­ti­mal, but I will crash to­mor­row and I have a test to study for the next day. De­cide to work a bit, go to sleep at 12:30 and set alarm for 5:30AM. Finish the re­port, turn it in by 10AM, then nap an­other 2 hours be­fore study­ing.

We built this model based on not only small sam­ples of our own his­tory, but learn­ing from oth­ers, in­cor­po­rat­ing data about see­ing other peo­ple’s ex­pe­riences. We don’t con­sider stay­ing up all night and then driv­ing to hand­ing the re­port, be­cause we re­al­ize ex­hausted driv­ing is dan­ger­ous—be­cause we heard sto­ries of peo­ple do­ing so, and know that we would be similarly un­steady. Is a per­son go­ing to ex­plore and try differ­ent strate­gies by stay­ing up all night and driv­ing? If you die, you can’t learn from the ex­pe­rience—so you have good ideas ab out what parts of the ex­plo­ra­tion space are safe to try. You might use Ad­der­all be­cause it’s been tried be­fore and is rel­a­tively safe, but you don’t in­gest ar­bi­trary drugs to see if they help you think.

BUT an AI doesn’t (at first) have that sam­ple data to rea­son from, nor does a sin­gle­ton have ob­ser­va­tion of other near-struc­turally iden­ti­cal AI sys­tems and the im­pacts of their de­ci­sions, nor does it have a fun­da­men­tal un­der­stand­ing about what is safe to ex­plore.

• Con­tent feed­back : the in­fer­en­tial dis­tance be­tween Löb’s the­o­rem and spu­ri­ous coun­ter­fac­tu­als seems larger than that of the other points. Maybe that’s be­cause I haven’t in­ter­nal­ised the the­o­rem, not be­ing a lo­gi­cian and all.

Un­nec­es­sary nit­pick: the gears in the robot’s brain would turn just fine as drawn: since the outer gears are both turn­ing an­ti­clock­wise, the in­ner gear would just turn clock­wise. (I think my in­ner en­g­ineer is show­ing)

• ## Thoughts on coun­ter­fac­tual reasoning

Th­ese ex­am­ples of coun­ter­fac­tu­als are pre­sented as equiv­a­lent, but they seem mean­ingfully dis­tinct:

What if the sun sud­denly went out?
What if 2+2=3?

Speci­fi­cally, they don’t seem equally difficult for me to eval­u­ate. I can eas­ily imag­ine the sun go­ing out, but I’m not even sure what it would mean if 2+2=3. It con­fuses me that these two differ­ent ex­am­ples are pre­sented as equiv­a­lent, be­cause they seem to be in­stances of mean­ingfully dis­tinct classes of some­thing. I spent some time try­ing to char­ac­ter­ize why the sun ex­am­ple is in­tu­itively easy for me and the math ex­am­ple is in­tu­itively difficult for me. I came up with some ideas, but I won’t go into de­tails yet be­cause they seem like the ob­vi­ous sorts of things that any­one who has read The Se­quences (a.k.a., Ra­tion­al­ity: A-Z) would have thought of. I strongly sus­pect there’s prior work. It is also pos­si­ble that I don’t fully un­der­stand the prob­lem yet.