Problems with Counterfactual Oracles

“we could de­sign a Pre­dic­tor that in­cor­po­rates a quan­tum ran­dom num­ber gen­er­a­tor, such that with some small prob­a­bil­ity it will out­put “no pre­dic­tions to­day, run me again to­mor­row”. Then have the Pre­dic­tor make pre­dic­tions that are con­di­tional on it giv­ing the out­put “no pre­dic­tions to­day, run me again to­mor­row” (Wei Dai, 2012)

That’s es­sen­tially Coun­ter­fac­tual Or­a­cles. Here is my sum­mary:

  • Every timestep : a hu­man asks a ques­tions (for in­stance “what will be TSLA stock price in June 11 2030?”), the or­a­cle (ac­tu­ally a pre­dic­tor) emits a cor­re­spond­ing pre­dic­tion , and re­ceives a re­ward

  • One of its ac­tion is an es­cape an­swer per­suad­ing hu­man su­per­vi­sors to let it out of its sand­box en­vi­ron­ment (e.g. pre­dict­ing £192,116,518,914.20 en­codes “SUPERINT”). This ac­tion would re­turn max­i­mum re­ward if suc­cess­full, as the AI would have more con­trol over the re­ward mechanism (be­cause it was re­leased).

  • To pre­vent hu­mans from read­ing es­cape mes­sages, an­swers are not di­rectly read by hu­mans. In­stead, one of the two sce­nario hap­pen:

    • With some small prob­a­bil­ity (re­mem­ber the quan­tum ran­dom num­ber gen­er­a­tor?), an au­to­matic ma­chine pro­tects the an­swer from be­ing read un­til we know the ac­tual an­swer to the ques­tion. Then, the or­a­cle re­ceives a re­ward cor­re­spond­ing to how close its pre­dic­tion was to the ac­tual value. In the first ex­am­ple, the re­ward could be where and are re­spec­tively the pre­dicted and true stock price in 2030.

    • Other­wise, im­me­di­ately af­ter an­swer­ing, the AI re­ceives a re­ward and is shut­down. Then, hu­mans read the an­swer.

  • Thus, if the AI cares only about its im­me­di­ate re­ward af­ter one step, it will make pre­dic­tions con­di­tional on this “no hu­man ever sees my an­swer” event. This is great be­cause it will ac­tu­ally give the “true” pre­dic­tions about how the world would look like if its an­swers had no im­pact what­so­ever.


This de­sign has sev­eral is­sues. I grouped them in the fol­low­ing sub-sec­tions:

  • As­sump­tions: cer­tain as­sump­tions are unrealistic

  • Safety: even if those as­sump­tions are satis­fied, the de­sign isn’t safe.

  • Use­ful­ness: in the world where we ac­tu­ally build such or­a­cle, we can­not ex­tract much value from it.


The biggest as­sump­tion is that the AI only cares about the re­ward it gets for the cur­rent timestep. A suffi­ciently in­tel­li­gent agent would un­der­stand that af­ter hav­ing be­ing shut­down, an (al­most) iden­ti­cal ver­sion of it­self will prob­a­bly be fac­ing a similar ques­tion. There­fore, it it wants fu­ture ver­sions of it­self to suc­ceed at (al­most) ex­actly the same task, it should still want to es­cape.

This de­sign also takes for granted some sort of magic box where hu­mans can­not peek at the an­swer. In prac­tice, hu­mans will have both in­cen­tives to peek (buy­ing/​sel­l­ing TSLA stock 10 years in ad­vance be­cause they know the pre­dic­tion) and means for peek­ing. In­deed, the au­to­matic ma­chine might ne­ces­sit main­te­nance (for in­stance be­cause ev­ery ques­tion re­quires a differ­ent script for re­triev­ing the true value), so main­tain­ers might have some spe­cial ac­cess to open/​mod­ify it.


My main con­cern about the coun­ter­fac­tual or­a­cle is that it doesn’t pre­vent the AI from send­ing fatal es­cape mes­sages. In­deed, it spends most of its time in ex­plo­ra­tory mode at the be­gin­ning (as it is only re­warded with prob­a­bil­ity ) and might stum­ble upon an es­cape mes­sage/​ac­tion then. Even if it is shut­down af­ter an­swer­ing, the hu­mans will still read the re­ally con­vinc­ing es­cape mes­sage and re­lease the AI.

Also, some hu­mans will need to be on the loop to make the “au­to­matic ma­chine” run. If we’re ask­ing ques­tions about some value that can­not be eas­ily queried (i.e. not “TSLA stock price” but “will this med­i­cal treat­ment work”), then a hu­man will some­how need to com­mu­ni­cate the true value to the au­to­matic ma­chine be­fore the re­ward is com­puted.


If I were to ex­plain what makes coun­ter­fac­tual or­a­cles work very bluntly, I would just say:

Prevent hu­mans from look­ing at an­swers in train­ing. You can then look at its an­swers at test time be­cause it has been in­cen­tivized to pre­dict things about the coun­ter­fac­tual world where hu­mans don’t look at its pre­dic­tions.

More gen­er­ally, the whole setup is just 1) train your model in a sand­box en­vi­ron­ment 2) have the weights of your model fixed dur­ing de­ploy­ment. The rest of the de­sign (pro­vid­ing re­wards of 0, shut­ting it down, etc.) ap­pears to be over-en­g­ineer­ing.

Fur­ther­more, hav­ing ac­cess to a coun­ter­fac­tual or­a­cle might not prove so use­ful at test time. In­deed, when read­ing the pre­dic­tions of the coun­ter­fac­tual or­a­cle we’re not in the coun­ter­fac­tual world (=train­ing dis­tri­bu­tion) any­more, so the pre­dic­tions can get ar­bi­trar­ily wrong (de­pend­ing on how much the pre­dic­tions are ma­nipu­la­tive and how many peo­ple peek at it).