No, I won’t go there, it feels like you’re try­ing to Pas­cal-mug me

The goal of this blog-post is to ex­plore the in­tu­ition that drives the feel­ing that the agent who ar­rives at the con­clu­sion that they should pay Pas­cal’s mug­ger is be­ing un­reas­on­able, and whether this can be tied in with the ana­lysis of a lo­gical in­ductor as an al­gorithm not ex­ploit­able by an ef­fi­ciently com­put­able trader. An ana­lysis of this in­tu­ition may help us to design the de­cision the­ory of a fu­ture Friendly AI so that it is not vul­ner­able to be­ing Pas­cal-mugged, which is plaus­ibly a de­sir­able goal from the point of view of ob­tain­ing be­ha­viour in align­ment with our val­ues.

So, then, con­sider the fol­low­ing con­ver­sa­tion that I re­cently had with an­other per­son at a MIRI work­shop.

“Do you know Amanda Askell’s work? She takes Pas­cal’s wager ser­i­ously. If there really are out­comes in the out­come space with in­fin­ite util­ity and non-in­fin­ites­imal prob­ab­il­ity, then that de­serves your at­ten­tion.”

“Oh, but that’s very Good­hart­able.”

A thought in close prox­im­ity to this would be that that line of reas­on­ing would open you up to ex­ploit­a­tion by agents not aligned with your core val­ues. This is plaus­ibly the main driver be­hind the in­tu­ition that pay­ing Pas­cal’s mug­ger is un­reas­on­able, a de­cision the­ory like that makes you too ex­ploit­able.

The work done by MIRI on the ana­lysis of the no­tion of lo­gical in­duc­tion iden­ti­fies the key de­sir­able prop­erty of a lo­gical in­ductor al­gorithm as pro­du­cing a pat­tern of be­lief states evolving over time which isn’t ex­ploit­able by any ef­fi­ciently com­put­able trader.

Traders study­ing the his­tory of your be­lief states won’t be able to identify any pat­tern that they can ex­ploit with a poly­no­mial-time com­put­able al­gorithm, so when such pat­terns emerge in the sen­tences you have already proved, you’ll make the ne­ces­sary “mar­ket cor­rec­tion” in your own be­lief states so that such pat­terns don’t make you ex­ploit­able. Thus, be­ing a good lo­gical in­ductor is re­lated to hav­ing pat­terns of be­ha­viour that are not ex­ploit­able. We see here some hint of a con­nec­tion with the no­tion that be­ing Pas­cal’s-mug­ging-re­si­li­ent is re­lated to not be­ing ex­ploit­able.

Plaus­ibly a use­ful re­search pro­ject would be to ex­plore whether some sim­ilar form­al­is­able no­tion in an AI’s de­cision the­ory could cap­ture the es­sence of the thought be­hind “Oh, Pas­cal’s-mug­ging reas­on­ing looks too sus­pi­cious”. Per­haps the de­sired line of reas­on­ing might be “A reg­u­lar habit of be­ing per­suaded by Pas­cal-mug­ging style reas­on­ing would make me vul­ner­able to ex­ploit­a­tion by agents not aligned with my core val­ues, in­de­pend­ently of whether such ex­ploit­a­tion is in fact likely to oc­cur given my be­liefs about the uni­verse. If I can see that an ar­gu­ment for a par­tic­u­lar con­clu­sion about which ac­tion to take mani­fests that kind of vul­ner­ab­il­ity, I ought to in­vest more com­put­ing time be­fore ac­tu­ally tak­ing the ac­tion”. It is not hard to ima­gine that nat­ural se­lec­tion might have been a driver for such a heur­istic and this may ex­plain why Pas­cal-mug­ging-style reas­on­ing “feels sus­pi­cious” to hu­mans even when they haven’t yet con­struc­ted the de­cision-the­or­etic ar­gu­ment that yields the con­clu­sion not to pay the mug­ger. If we can build a sim­ilar heur­istic into an AI’s de­cision the­ory, this may help to fil­ter out “un­in­ten­ded in­ter­pret­a­tions” of what kind of be­ha­viour would op­tim­ise what hu­mans care about, such as large-scale wire­head­ing, or in­vest­ing all your com­put­ing re­sources into fig­ur­ing out some way to hack the laws of phys­ics to pro­duce in­fin­ite util­ity des­pite very low prob­ab­il­ity of a pos­it­ive pay-off.

We also see here some hint of a con­nec­tion between the in­sights that drove the “cor­rect solu­tion” to what a lo­gical in­ductor should be and the “cor­rect solu­tion” to what the best de­cision the­ory should be. Ex­plor­ing whether this is in some way gen­er­al­is­able to other do­mains may also be a line of thought that de­serves fur­ther ex­am­in­a­tion.