Counterfactual Oracles = online supervised learning with random selection of training episodes

Most peo­ple here prob­a­bly already un­der­stand this by now, so this is more to pre­vent new peo­ple from get­ting con­fused about the point of Coun­ter­fac­tual Or­a­cles (in the ML set­ting) be­cause there’s not a top-level post that ex­plains it clearly at a con­cep­tual level. Paul Chris­ti­ano does have a blog post ti­tled Coun­ter­fac­tual over­sight vs. train­ing data, which talks about the same thing as this post ex­cept that he uses the term “coun­ter­fac­tual over­sight”, which is just Coun­ter­fac­tual Or­a­cles ap­plied to hu­man imi­ta­tion (which he pro­poses to use to “over­see” some larger AI sys­tem). But the fact he doesn’t men­tion “Coun­ter­fac­tual Or­a­cle” makes it hard for peo­ple to find that post or see the con­nec­tion be­tween it and Coun­ter­fac­tual Or­a­cles. And as long as I’m writ­ing a new top level post, I might as well try to ex­plain it with my own words. The sec­ond part of this post lists some re­main­ing prob­lems with or­a­cles/​pre­dic­tors that are not solved by Coun­ter­fac­tual Or­a­cles.

Without fur­ther ado, I think when the Coun­ter­fac­tual Or­a­cle is trans­lated into the ML set­ting, it has three es­sen­tial char­ac­ter­is­tics:

  1. Su­per­vised train­ing—This is safer than re­in­force­ment learn­ing be­cause we don’t have to worry about re­ward hack­ing (i.e., re­ward gam­ing and re­ward tam­per­ing), and it elimi­nates the prob­lem of self-con­firm­ing pre­dic­tions (which can be seen as a form of re­ward hack­ing). In other words, if the only thing that ever sees the Or­a­cle’s out­put dur­ing a train­ing epi­sode is an au­to­mated sys­tem that com­putes the Or­a­cle’s re­ward/​loss, and that sys­tem is se­cure be­cause it’s just com­put­ing a sim­ple dis­tance met­ric (com­par­ing the Or­a­cle’s out­put to the train­ing la­bel), then re­ward hack­ing and self-con­firm­ing pre­dic­tions can’t hap­pen.

    • In­de­pen­dent la­bel­ing of data—This is usu­ally taken for granted in su­per­vised learn­ing but per­haps should be ex­plic­itly em­pha­sized here. To pre­vent self-con­firm­ing pre­dic­tions, the la­bel­ing of data has to be done with­out causal in­fluence from the Or­a­cle. That is, the Or­a­cle’s out­put should be iso­lated in some pro­tected sys­tem un­til it’s time to com­pute the re­ward/​loss so that it can’t af­fect how the train­ing la­bel is gen­er­ated. But note that it’s perfectly fine for hu­mans or other sys­tems to look at the ques­tion/​in­put that is given to the Or­a­cle, in or­der gen­er­ate the train­ing la­bel.

  2. On­line learn­ing—The Or­a­cle never stops learn­ing, so it can even­tu­ally ad­just to any dis­tri­bu­tional shift. (But note that an Or­a­cle im­ple­mented us­ing cur­rent ML tech­niques might perform very badly, pos­si­bly in a ma­lign way, while it’s still ad­just­ing.)

  3. Ran­dom se­lec­tion of train­ing epi­sodes—I’ll just quote Paul here: “Coun­ter­fac­tual over­sight con­sists of la­bel­ling a ran­dom sub­set of data and us­ing it as on­line train­ing data. The key differ­ence is that any given data point may be­come a train­ing data point, with the de­ci­sion made af­ter the learn­ing sys­tem has made a de­ci­sion about it. As long as the ran­dom­iza­tion is un­pre­dictable to the learner, this gives us a for­mal guaran­tee that there can’t be any no­tice­able differ­ence be­tween the train­ing and test data. And there­fore if our learner be­haves well on train­ing data, it re­ally must be­have well on test data.” (Note that what Stu­art Arm­strong calls “era­sure” just means that the cur­rent epi­sode has been se­lected as a train­ing epi­sode.)

Re­main­ing Safety Problems

Coun­ter­fac­tual Or­a­cles solve (or are a pro­posal to solve) some safety prob­lems as­so­ci­ated with pre­dic­tors/​or­a­cles, but oth­ers re­main. Be­sides the dis­tri­bu­tional shift prob­lem men­tioned above, here are a few more that come to mind. Note that these are not prob­lems spe­cific to Coun­ter­fac­tual Or­a­cles, but Stu­art Arm­strong’s Safe Uses of AI Or­a­cles seems wrong or mis­lead­ing when it says “This pa­per pre­sented two Or­a­cle de­signs which are both safe and use­ful.”

  1. Benign In­duc­tion—This is a con­cern that the Or­a­cle might come to be­lieve (per­haps cor­rectly in some sense) that we prob­a­bly live in a simu­la­tion or Zoo Earth, and that at some spe­cific point the alien simu­la­tion run­ners or zoo keep­ers will in­ter­vene in the simu­la­tion or zoo to change the train­ing sig­nal for the Or­a­cle, caus­ing the Or­a­cle (on both fake and real Earths) to out­put some­thing that for ex­am­ple causes our world to be taken over by the aliens.

  2. Re­ward Takeover on Real Earth—The Or­a­cle may pre­dict a fu­ture in which its train­ing data /​ re­ward sig­nal comes un­der the con­trol of a hos­tile agent that will arise or already ex­ists on the real Earth (e.g., an un­al­igned AI or an un­friendly gov­ern­ment) which causes it to out­put some­thing that benefits that hos­tile agent at the ex­pense of its cur­rent user.

  3. Safe Ap­prox­i­ma­tion—If the Or­a­cle is not able to ex­actly pre­dict some­thing, could its at­tempt at ap­prox­i­ma­tion cause safety prob­lems? For ex­am­ple if we use it to pre­dict hu­man be­hav­ior, could the ap­prox­i­mate pre­dic­tion be un­safe be­cause it ends up pre­dict­ing a hu­man-like agent with non-hu­man val­ues?

  4. In­ner al­ign­ment—The ML train­ing pro­cess may not pro­duce a model that ac­tu­ally op­ti­mizes for what we in­tend for it to op­ti­mize for (namely min­i­miz­ing loss for just the cur­rent epi­sode, con­di­tional on the cur­rent epi­sode be­ing se­lected as a train­ing epi­sode).

(I just made up the names for #2 and #3 so please feel free to sug­gest im­prove­ments for them.)