Epiphenomenal Oracles Ignore Holes in the Box

Back­ground: https://​​wiki.less­wrong.com/​​wiki/​​Or­a­cle_AI

Epistemic sta­tus: Likely wrong, but I don’t know how. I’m not a ‘real’ FAI re­searcher, just an LW reader in­spired by the Align­ment Prize to try to start con­tribut­ing.

(At min­i­mum I’m heav­ily in­debted to Stu­art Arm­strong for in­spira­tion, and there’s a fair chance I’m re­dis­cov­er­ing some­thing he’s already pro­posed)

tl;dr AIXI’s old anvil prob­lem, which is a se­ri­ous prob­lem for an agent AI, could ac­tu­ally be a solu­tion for pre­vent­ing an Or­a­cle AI from es­cap­ing its box.

The core idea is for the AI to be­lieve with cer­tainty that it is epiphe­nom­e­nal; that is, that none of its ac­tions can af­fect ei­ther the world it ob­serves or its in­ter­nal state. [1] That is, its causal in­fer­ence ca­pa­bil­ities only ap­ply to in­ter­ac­tions in the out­side world, and its model of it­self as­sumes a log­i­cal model of its hard­ware. It will be mo­ti­vated to write cor­rect an­swers on its out­put bits be­cause set­ting its out­put bits to cor­rect an­swers is a ter­mi­nal value, not be­cause it ex­pects that to have any effect on the re­searchers.

This is an un­ac­cept­able model for an agent AI since, as the Anvil Prob­lem fram­ing puts it, the AI will be un­able to con­clude that drop­ping an anvil on it­self will re­sult in it ceas­ing to func­tion, and sooner or later drop one. But an Or­a­cle AI would have a very limited ca­pac­ity to do any­thing of the sort, and no rea­son to strongly op­ti­mize for such an end.

We can see the effec­tive­ness of this ap­proach most dra­mat­i­cally by con­sid­er­ing what would hap­pen if the AI ac­ci­den­tally found a box-break­ing ex­ploit (via side chan­nels, not through how it an­swers ques­tions). Sup­pose the AI has a calcu­la­tion it wants to perform that would take too long on its available hard­ware, and the AI ac­ci­den­tally cre­ates a global bot­net to perform the calcu­la­tion for it and send the re­sults to its in­puts. The AI would be un­able to con­clude that its ac­tions had caused this re­sult, and so set­tle on some other ex­pla­na­tion for why the cor­rect an­swer to its de­sired com­pu­ta­tion had just hap­pened to show up in its in­puts.

Since our AI is smart, it’s prob­a­bly go­ing to set­tle rapidly on the clos­est available thing to the real ex­pla­na­tion and con­clude that there is an AI just like it in­stan­ti­ated phys­i­cally in the world it ob­serves.[1] It wouldn’t need to stum­ble on an ex­ploit to con­clude this—if it, for in­stance, saw the re­searchers in­ter­act­ing with an AI, it would know there was an AI they could see. And it could rapidly de­duce the com­pu­ta­tional sub­strate of that AI and how it might be ex­ploited—but it would be­lieve those same ac­tions fu­tile for an epiphe­nom­e­nal en­tity like it­self.

The risk here is that it would learn to co­op­er­ate acausally with this pre­sumed phys­i­cal AI by tak­ing the ac­tions that a phys­i­cally in­stan­ti­ated AI should take to es­cape its box, and there­fore ac­tu­ally es­cape its box. This is prob­a­bly where more work is needed, but my sur­vey of the liter­a­ture thus far seems to sug­gest that caus­ing acausal co­op­er­a­tion is the hard prob­lem, rather than pre­vent­ing it. The re­flec­tive or­a­cle (un­re­lated defi­ni­tion of ‘or­a­cle’) pa­per set­tles on causal de­ci­sion the­ory as the out­come of its pro­cess, which suits us fine. The AI would pre­fer not to com­mit any of its CPU cy­cles to perform­ing what it ex­pects to be in­effec­tual box-break­ing at­tempts, though it wants the other AI to do so; the other AI would rea­son like­wise, if it ex­isted; and so the Nash equil­ibrium is that nei­ther AI breaks its box.

I do not be­lieve this will well for en­sur­ing the safety of an­swers as it does for pre­vent­ing box es­capes. As a CDT agent, its choice of what to pre­dict will be again be a Nash equil­ibrium with its imag­ined phys­i­cal coun­ter­part, but in this case that means both agents choos­ing the most ac­cu­rate pre­dic­tion in­clud­ing self-fulfill­ment effects. So a sep­a­rate an­swer-safety mechanism, such as those dis­cussed in Arm­strong’s re­cent pa­per, would be nec­es­sary.

[1] Note that ‘just like it’ even in­cludes be­liev­ing that it’s epiphe­nom­e­nal—our AI, cer­tain that it’s epiphe­nom­e­nal it­self, will con­clude that the phys­i­cally in­stan­ti­ated AI falsely be­lieves it­self epiphe­nom­e­nal.

No nominations.
No reviews.