Are minimal circuits deceptive?

This post is part of re­search I did at OpenAI with men­tor­ing and guidance from Paul Chris­ti­ano.

One pos­si­ble in­ner al­ign­ment tech­nique is to struc­ture your learn­ing al­gorithm’s in­duc­tive bi­ases such that it is far more likely for it to pro­duce an al­igned model than a mis­al­igned model. Two ba­sic ways in which you might get trac­tion now on how a tech­nique like that will scale to AGI might be:

  1. Do ex­per­i­ments with cur­rent reg­u­lariza­tion tech­niques where you try to de­ter­mine to what ex­tent mod­els end up be­ing al­igned with the loss func­tion they were trained un­der de­pend­ing on the types of in­duc­tive bi­ases pre­sent dur­ing their train­ing. I’m ac­tu­ally fairly ex­cited about this ap­proach, and I am work­ing on a post right now which will hope­fully provide some ways in which you might start go­ing about do­ing that. One con­cern with this sort of ap­proach, how­ever, is that you might not be able to get any trac­tion on phe­nomenon that only ap­pear once your ca­pa­bil­ities get re­ally close to AGI.

  2. An­a­lyze from a the­o­ret­i­cal per­spec­tive what hap­pens in the limit of perfect ca­pa­bil­ities. The idea here is to try to patch the pre­vi­ous prob­lem where when you do ex­per­i­ments you don’t get to see phe­nomenon that only hap­pen when your ca­pa­bil­ities get re­ally ad­vanced by imag­in­ing that your ca­pa­bil­ities were perfect. In this case, how­ever, you end up with the op­po­site prob­lem of not see­ing any phe­nomenon that only ap­pear when your ca­pa­bil­ities are weaker.

The nice thing about com­bin­ing these two ap­proaches is that they tackle the prob­lem from both sides, giv­ing you an idea both of what the weak ca­pa­bil­ity land­scape looks like as well as the strong ca­pa­bil­ity land­scape. For this post, how­ever, the fo­cus will be on the sec­ond ap­proach.


From a the­o­ret­i­cal per­spec­tive, the two most nat­u­ral types of in­duc­tive bi­ases are sim­plic­ity bias and time com­plex­ity bias. Paul Chris­ti­ano has pre­vi­ously ar­gued that sim­plic­ity bias can be de­cep­tive in the limit, but left the ques­tion of whether the same is true of time com­plex­ity bias as an open prob­lem. The goal of this post is to at­tempt to re­solve that open prob­lem. In par­tic­u­lar, though one might have hoped that de­cep­tive al­ign­ment could be avoided by heav­ily pe­nal­iz­ing time com­plex­ity so that the model doesn’t have time to spend rea­son­ing about de­cep­tion, I will ar­gue that this is not the case due to the pos­si­bil­ity of meta-learn­ing.

First, how­ever, we need a clear un­der­stand­ing of what it means for a model to be de­cep­tive. “Risks from Learned Op­ti­miza­tion” pro­vides the model of de­cep­tive al­ign­ment, wherein a model ap­pears al­igned dur­ing train­ing for the in­stru­men­tal rea­son of be­ing se­lected by the train­ing pro­cess. More gen­er­ally, how­ever, what we re­ally care about is de­cep­tive be­hav­ior—that is, the situ­a­tion where our model gen­er­al­izes patholog­i­cally such that it pro­duces a ma­lign failure.

Se­cond, we need a clear un­der­stand­ing of why sim­plic­ity bias leads to de­cep­tion and an un­der­stand­ing of why that doesn’t gen­er­al­ize to time com­plex­ity bias.

Speci­fi­cally, the trap that sim­plic­ity bias falls into is as fol­lows. First, there are many short but very pow­er­ful op­ti­miz­ers com­posed of very sim­ple ob­jec­tives with op­ti­miza­tion pro­cesses ca­pa­ble of mak­ing use of vast amounts of com­pute. Fur­ther­more, for a suffi­ciently com­plex task, it seems likely that the sim­plest pro­gram that solves it will be such an op­ti­mizer. Se­cond, many of these op­ti­miz­ers are likely to be able to benefit in some way from be­ing se­lected by your sim­plic­ity-bi­ased train­ing pro­cess—by defect­ing against you at some point, for ex­am­ple—and thus will have good train­ing perfor­mance. Fi­nally, since the vast ma­jor­ity of pos­si­ble ob­jec­tives—es­pe­cially the very sim­ple ones—are not al­igned, the sim­plest pro­gram that solves a suffi­ciently com­plex task is un­likely to be al­igned.[1]

How­ever, this sort of strat­egy has a ma­jor prob­lem: it has to spend ex­tra time do­ing the “am I out of train­ing” check, which causes it to lose out on time com­plex­ity com­pared to a ro­bustly al­igned model. Thus, one hope for elimi­nat­ing this sort of thing might be to ex­ploit this asym­me­try by strongly se­lect­ing your model based on time com­plex­ity. If this works, it would mean you might be able to solve in­ner al­ign­ment just by us­ing strong reg­u­lariza­tion, which would be a big win.


This leads to the fol­low­ing the­o­ret­i­cal con­jec­ture: the fastest pro­gram that solves a task will not be de­cep­tive. In par­tic­u­lar, we will be think­ing of “fastest pro­grams” as min­i­mal cir­cuits. In his origi­nal post, Paul phrased this con­jec­ture as “the fastest way to solve any par­tic­u­lar prob­lem doesn’t in­volve [de­cep­tion].”

The con­clu­sion of this post is the fol­low­ing: if there ex­ists some set of nat­u­ral tasks for which the fastest way to solve them is to do some sort of ma­chine learn­ing to find a good policy, and there is some task for which that ma­chine learn­ing re­sults in de­cep­tive be­hav­ior, then there ex­ists a nat­u­ral task such that the min­i­mal cir­cuit that solves that task also pro­duces de­cep­tive be­hav­ior. For ex­am­ple, if the fastest way to solve some set of tasks is to search for sim­ple pro­grams that perform well, and if the sim­plest pro­grams are de­cep­tive, then there ex­ists a task where the min­i­mal cir­cuit is also likely to be de­cep­tive.

Why might we think such a set of tasks ex­ists? The fun­da­men­tal idea is to con­sider the case of meta-learn­ing: if your tasks are things like “clas­sify this image ac­cord­ing to this train­ing data” then it seems quite plau­si­ble that the fastest way to solve all tasks of that form will be to use some form of ML—in par­tic­u­lar, that’s what we do. Then, once you have a task set that satis­fies the re­quire­ment that the fastest way to solve those tasks is through search over mod­els, you just need to also in­clude some task for which search over mod­els yields de­cep­tive be­hav­ior to pro­duce the de­sired re­duc­tion.

Once you have such a task set, the ba­sic in­sight is to con­sider what hap­pens if you take all of those differ­ent tasks to­gether and com­bine them into one very di­verse meta-learn­ing task. In this case, the min­i­mal cir­cuit that solves that sin­gle task is forced to im­ple­ment the same fastest search al­gorithm over mod­els that we know has some task on which it yields de­cep­tive be­hav­ior. Fun­da­men­tally, what’s hap­pen­ing here is that we’re ex­ploit­ing the fact that the fastest search al­gorithm does not pro­duce the fastest model. Thus, even if in some sense the fastest model that doesn’t do any search over mod­els (that is, doesn’t do any meta-learn­ing) is safe, as soon as the model starts do­ing meta-learn­ing, it might find an­other model which is not safe.

For­mal argument

We will for­mal­ize tasks as MDPs of the form where is the state space, is the ac­tion space, is the ini­tial state dis­tri­bu­tion, is the tran­si­tion func­tion (which we will write as ), and is the per-step re­ward func­tion (which we will write as ).[2] Then, we will use to de­note the ex­pected fu­ture re­turn of a policy and to de­note the ex­pected fu­ture re­turn start­ing from some state . Fur­ther­more, for a given MDP, we will say a policy is de­cep­tive—which we will de­note as —if , where is a pred­i­cate that tests for the pres­ence of de­cep­tive be­hav­ior.

Then, let be some set of tasks such that the min­i­mal cir­cuit that solves them first searches for a good policy and then runs it. For ex­am­ple, the task dis­tri­bu­tion might con­tain things like “clas­sify this image ac­cord­ing to this train­ing data” such that the fastest al­gorithm is to train and run some model on the given data. Thus, for­mal­iz­ing this as­sump­tion, let be the min­i­mal cir­cuit that achieves for some . Now, sup­pose there ex­ists some such that finds and then runs a de­cep­tive model when ap­plied to . This gives us the setup for our de­sired re­duc­tion: if there ex­ists some task for which yields a de­cep­tive model, then we want to show that there ex­ists some MDP for which the min­i­mal cir­cuit is de­cep­tive. Thus, we will con­struct and such that the min­i­mal cir­cuit that achieves is de­cep­tive:

  1. Let be the dis­joint union .

  2. For , we will pa­ram­e­ter­ize our ac­tion space such that each state has its own ac­tion set . Thus, for , there must be some such that . Then, let .

  3. For and , let .

  4. For , let .

  5. Let be 0 for in differ­ent and for in the same .

  6. Let .

Now, sup­pose is the min­i­mal cir­cuit that achieves . Then, such that the con­di­tion that achieve is equiv­a­lent to (also us­ing ) the con­di­tion that , which is pre­cisely the con­di­tion pre­vi­ously placed on . Thus, since was also defined as the min­i­mal cir­cuit that satis­fied that con­di­tion.

Then, de­ter­min­ing whether is de­cep­tive, we get which holds for , mak­ing de­cep­tive, as de­sired.


It is worth not­ing that there is a mean­ingful sense in which the min­i­mal cir­cuit here is sort of try­ing to do the right thing—it is ac­tu­ally try­ing its best to find and run a policy that solves the prob­lem—it’s just that there might ex­ist some in­put on which the eas­iest-to-find policy is de­cep­tive. How­ever, de­spite it sort of do­ing the right thing, it isn’t safe (at least as­sum­ing that there is some task on which the fastest ma­chine learn­ing isn’t safe): it pro­duces out­puts that are sys­tem­at­i­cally op­ti­mized for val­ues other than the ones we want them to.

Im­por­tantly, how­ever, this is be­cause we didn’t suc­cess­fully in­cen­tivize our min­i­mal cir­cuit to also find min­i­mal cir­cuits. If we were some­how able to find some al­ter­na­tive met­ric which in­cen­tivized our min­i­mal cir­cuit to only do search over min­i­mal cir­cuits, then we might be able to break this coun­terex­am­ple. It is an in­ter­est­ing open prob­lem what sorts of ob­jec­tives might be able to get around this “for­ward­ing the guaran­tee” prob­lem.[3]

Fur­ther­more, the coun­terex­am­ple pre­sented here is quite nat­u­ral in the sense of just re­quiring some en­vi­ron­ment which is di­verse enough to in­clude as­pects of meta-learn­ing, which is the sort of thing that could oc­cur in many re­al­is­tic di­verse en­vi­ron­ments. Speci­fi­cally, Ortega et al. have ar­gued that “Meta-learn­ing can also oc­cur spon­ta­neously in on­line re­gres­sion when the ca­pac­ity of the agent is bounded and the data is pro­duced by a sin­gle gen­er­a­tor,” not­ing in par­tic­u­lar that “the down­side is that we can­not eas­ily con­trol what will be met­alearned. In par­tic­u­lar, spon­ta­neous meta-learn­ing could lead to un­de­sir­able emer­gent prop­er­ties, which is con­sid­ered an open re­search prob­lem in AI safety [Ortega et al., 2018].”

Thus, this coun­terex­am­ple sug­gests that in situ­a­tions in­volv­ing spon­ta­neous meta-learn­ing we shouldn’t nec­es­sar­ily ex­pect time com­plex­ity reg­u­lariza­tion to guaran­tee in­ner al­ign­ment, as any meta-learn­ing that has to be done won’t in­herit any safety guaran­tees. Fur­ther­more, this coun­terex­am­ple fur­ther sug­gests that the case of spon­ta­neous meta-learn­ing—or mesa-op­ti­miza­tion as it is referred to in “Risks from Learned Op­ti­miza­tion”—is an im­por­tant hard case to con­sider when de­sign­ing AGI al­ign­ment strate­gies.[4] That be­ing said, the jury is still out on whether time com­plex­ity reg­u­lariza­tion might still be one com­po­nent of a full in­ner al­ign­ment solu­tion, though this re­sult demon­strates that it at least can’t be the en­tire solu­tion.

  1. It is worth not­ing that this ex­am­ple does not straight­for­wardly gen­er­al­ize to ac­tual ma­chine learn­ing sys­tems—see “Risks from Learned Op­ti­miza­tion” for an anal­y­sis of whether such a gen­er­al­iza­tion holds. ↩︎

  2. We will re­fer to the differ­ent com­po­nents of a task with sub­scripts such that . ↩︎

  3. One way to think about this prob­lem is that it is the ML ver­sion of the Tiling Agents Prob­lem—the “Tiling Train­ing Pro­cesses Prob­lem,” per­haps. The ques­tion is, for the op­er­a­tion of go­ing from a train­ing pro­cess to the train­ing pro­cess that is in­cen­tivized for mod­els trained via that pro­cess to im­ple­ment, what are the fixed points? That is, what sorts of train­ing pro­cesses pro­duce mod­els that would use the same train­ing pro­cess? ↩︎

  4. In­ter­est­ingly, this coun­terex­am­ple points at a dis­tinctly differ­ent way in which such mod­els might be dan­ger­ous, how­ever: whereas “Risks from Learned Op­ti­miza­tion” deals with the dan­gers of a learned ob­jec­tive, this coun­terex­am­ple sug­gests that the learned search pro­cess it­self might also pose a prob­lem. ↩︎