No, I won’t go there, it feels like you’re trying to Pascal-mug me

The goal of this blog-post is to ex­plore the in­tu­ition that drives the feel­ing that the agent who ar­rives at the con­clu­sion that they should pay Pas­cal’s mug­ger is be­ing un­rea­son­able, and whether this can be tied in with the anal­y­sis of a log­i­cal in­duc­tor as an al­gorithm not ex­ploitable by an effi­ciently com­putable trader. An anal­y­sis of this in­tu­ition may help us to de­sign the de­ci­sion the­ory of a fu­ture Friendly AI so that it is not vuln­er­a­ble to be­ing Pas­cal-mugged, which is plau­si­bly a de­sir­able goal from the point of view of ob­tain­ing be­havi­our in al­ign­ment with our val­ues.

So, then, con­sider the fol­low­ing con­ver­sa­tion that I re­cently had with an­other per­son at a MIRI work­shop.

“Do you know Amanda Askell’s work? She takes Pas­cal’s wa­ger se­ri­ously. If there re­ally are out­comes in the out­come space with in­finite util­ity and non-in­finites­i­mal prob­a­bil­ity, then that de­serves your at­ten­tion.”

“Oh, but that’s very Good­hartable.”

A thought in close prox­im­ity to this would be that that line of rea­son­ing would open you up to ex­ploita­tion by agents not al­igned with your core val­ues. This is plau­si­bly the main driver be­hind the in­tu­ition that pay­ing Pas­cal’s mug­ger is un­rea­son­able, a de­ci­sion the­ory like that makes you too ex­ploitable.

The work done by MIRI on the anal­y­sis of the no­tion of log­i­cal in­duc­tion iden­ti­fies the key de­sir­able prop­erty of a log­i­cal in­duc­tor al­gorithm as pro­duc­ing a pat­tern of be­lief states evolv­ing over time which isn’t ex­ploitable by any effi­ciently com­putable trader.

Traders study­ing the his­tory of your be­lief states won’t be able to iden­tify any pat­tern that they can ex­ploit with a polyno­mial-time com­putable al­gorithm, so when such pat­terns emerge in the sen­tences you have already proved, you’ll make the nec­es­sary “mar­ket cor­rec­tion” in your own be­lief states so that such pat­terns don’t make you ex­ploitable. Thus, be­ing a good log­i­cal in­duc­tor is re­lated to hav­ing pat­terns of be­havi­our that are not ex­ploitable. We see here some hint of a con­nec­tion with the no­tion that be­ing Pas­cal’s-mug­ging-re­silient is re­lated to not be­ing ex­ploitable.

Plau­si­bly a use­ful re­search pro­ject would be to ex­plore whether some similar for­mal­is­able no­tion in an AI’s de­ci­sion the­ory could cap­ture the essence of the thought be­hind “Oh, Pas­cal’s-mug­ging rea­son­ing looks too sus­pi­cious”. Per­haps the de­sired line of rea­son­ing might be “A reg­u­lar habit of be­ing per­suaded by Pas­cal-mug­ging style rea­son­ing would make me vuln­er­a­ble to ex­ploita­tion by agents not al­igned with my core val­ues, in­de­pen­dently of whether such ex­ploita­tion is in fact likely to oc­cur given my be­liefs about the uni­verse. If I can see that an ar­gu­ment for a par­tic­u­lar con­clu­sion about which ac­tion to take man­i­fests that kind of vuln­er­a­bil­ity, I ought to in­vest more com­put­ing time be­fore ac­tu­ally tak­ing the ac­tion”. It is not hard to imag­ine that nat­u­ral se­lec­tion might have been a driver for such a heuris­tic and this may ex­plain why Pas­cal-mug­ging-style rea­son­ing “feels sus­pi­cious” to hu­mans even when they haven’t yet con­structed the de­ci­sion-the­o­retic ar­gu­ment that yields the con­clu­sion not to pay the mug­ger. If we can build a similar heuris­tic into an AI’s de­ci­sion the­ory, this may help to filter out “un­in­tended in­ter­pre­ta­tions” of what kind of be­havi­our would op­ti­mise what hu­mans care about, such as large-scale wire­head­ing, or in­vest­ing all your com­put­ing re­sources into figur­ing out some way to hack the laws of physics to pro­duce in­finite util­ity de­spite very low prob­a­bil­ity of a pos­i­tive pay-off.

We also see here some hint of a con­nec­tion be­tween the in­sights that drove the “cor­rect solu­tion” to what a log­i­cal in­duc­tor should be and the “cor­rect solu­tion” to what the best de­ci­sion the­ory should be. Ex­plor­ing whether this is in some way gen­er­al­is­able to other do­mains may also be a line of thought that de­serves fur­ther ex­am­i­na­tion.

No comments.