Towards a mechanistic understanding of corrigibility


To be able to use some­thing like re­laxed ad­ver­sar­ial train­ing to ver­ify a model, a nec­es­sary con­di­tion is hav­ing a good no­tion of ac­cept­abil­ity. Paul Chris­ti­ano de­scribes the fol­low­ing two desider­ata for any no­tion of ac­cept­abil­ity:

  1. “As long as the model always be­haves ac­cept­ably, and achieves a high re­ward on av­er­age, we can be happy.”

  2. “Re­quiring a model to always be­have ac­cept­ably wouldn’t make a hard prob­lem too much harder.”

While these are good con­di­tions that any no­tion of ac­cept­abil­ity must satisfy, there may be many differ­ent pos­si­ble ac­cept­abil­ity pred­i­cates that meet both of these con­di­tions—how do we dis­t­in­guish be­tween them? Two ad­di­tional ma­jor con­di­tions that I use for eval­u­at­ing differ­ent ac­cept­abil­ity crite­ria are as fol­lows:

  1. It must be not that hard for an am­plified over­seer to ver­ify that a model is ac­cept­able.

  2. It must be not that hard to find such an ac­cept­able model dur­ing train­ing.

Th­ese con­di­tions are differ­ent than Paul’s sec­ond con­di­tion in that they are state­ments about the ease of train­ing an ac­cept­able model rather than the ease of choos­ing an ac­cept­able ac­tion. If you want to be able to do some form of in­formed over­sight to pro­duce an ac­cept­able model, how­ever, these are some of the most im­por­tant con­di­tions to pay at­ten­tion to. Thus, I gen­er­ally think about choos­ing an ac­cept­abil­ity con­di­tion as try­ing to an­swer the ques­tion: what is the eas­iest-to-train-and-ver­ify prop­erty such that all mod­els that satisfy that prop­erty[1] (and achieve high av­er­age re­ward) are safe?

Act-Based Corrigibility

One pos­si­ble can­di­date prop­erty that Paul has pro­posed is act-based cor­rigi­bil­ity, wherein an agent re­spects our short-term prefer­ences, in­clud­ing those over how the agent it­self should be mod­ified. Not only is such an agent cor­rigible, Paul ar­gues, but it will also want to make it­self more cor­rigible, since hav­ing it be more cor­rigible is a com­po­nent of our short-term prefer­ences (Paul calls this the “broad basin” of cor­rigi­bil­ity). While such act-based cor­rigi­bil­ity would definitely be a nice prop­erty to have, it’s un­clear how ex­actly an am­plified over­seer could go about ver­ify­ing such a prop­erty. In par­tic­u­lar, if we want to ver­ify such a prop­erty, we need a mechanis­tic un­der­stand­ing of act-based cor­rigi­bil­ity rather than a be­hav­ioral one, since be­hav­ioral prop­er­ties can only be ver­ified by test­ing ev­ery in­put, whereas mechanis­tic prop­er­ties can be ver­ified just by in­spect­ing the model.

One pos­si­ble mechanis­tic un­der­stand­ing of cor­rigi­bil­ity is cor­rigible al­ign­ment as de­scribed in “Risks from Learned Op­ti­miza­tion,” which is defined as the situ­a­tion in which “the base ob­jec­tive is in­cor­po­rated into the mesa-op­ti­mizer’s epistemic model and [the mesa-op­ti­mizer’s] ob­jec­tive is mod­ified to ‘point to’ that in­for­ma­tion.” While this gives us a start­ing point for un­der­stand­ing what a cor­rigible model might ac­tu­ally look like, there are still a bunch of miss­ing pieces that have to be filled in. Fur­ther­more, this no­tion of cor­rigi­bil­ity looks more like in­stru­men­tal cor­rigi­bil­ity rather than act-based cor­rigi­bil­ity, which as Paul notes is sig­nifi­cantly less likely to be ro­bust. Mechanis­ti­cally, we can think of this lack of ro­bust­ness as com­ing from the fact that “point­ing” to the base ob­jec­tive is a pretty un­sta­ble op­er­a­tion: if you point even a lit­tle bit in­cor­rectly, you’ll end up with some sort of cor­rigible pseudo-al­ign­ment rather than cor­rigible ro­bust al­ign­ment.

We can make this model more act-based, and at least some­what miti­gate this ro­bust­ness prob­lem, how­ever, if we imag­ine point­ing to only the hu­man’s short-term prefer­ences. The hope for this sort of a setup is that, as long as the ini­tial poin­ter is “good enough,” there will be pres­sure for the mesa-op­ti­mizer to make its poin­ter bet­ter in the way in which its cur­rent un­der­stand­ing of short-term hu­man prefer­ences recom­mends, which is ex­actly Paul’s “broad basin” of cor­rigi­bil­ity ar­gu­ment. This re­quires it to be not that hard, how­ever, to find a model with a no­tion of the hu­man’s short-term prefer­ences as op­posed to their long-term prefer­ences that is also will­ing to cor­rect that no­tion based on feed­back.

In par­tic­u­lar, it needs to be the case that it is not that hard to find an agent which will cor­rect mis­takes in its own prior over what the hu­man’s short-term prefer­ences are. From a naive Bayesian per­spec­tive, this seems un­likely, as it seems strange for an agent to be in­cen­tivized to change its own prior. How­ever, this is ac­tu­ally a very nat­u­ral state for an agent to be in: if I trust your be­liefs about X more than I trust my own, then that means I would en­dorse a mod­ifi­ca­tion of my prior to match yours. In the con­text of act-based cor­rigi­bil­ity, we can think about this from a mechanis­tic per­spec­tive as hav­ing a pre-prior that en­codes a be­lief that the hu­man prior over hu­man short-term prefer­ences is to be preferred. Fur­ther­more, pre-pri­ors are gen­er­ally epistem­i­cally valuable for agents to have, as a pre-prior can en­courage an agent to cor­rect its own cog­ni­tive bi­ases. Thus, agents with pre-pri­ors should be in­cen­tivized by most train­ing pro­cesses, and thus shouldn’t be too difficult to find.

In­differ­ence Corrigibility

In­stru­men­tal and act-based cor­rigi­bil­ity are not the only forms of cor­rigi­bil­ity that have been dis­cussed in the liter­a­ture, how­ever: there’s also in­differ­ence cor­rigi­bil­ity, wherein the agent is in­differ­ent to mod­ifi­ca­tions the hu­man might make to it such as shut­ting it off. While this form of cor­rigi­bil­ity doesn’t in and of it­self guaran­tee ac­cept­abil­ity, if you had a way of ver­ify­ing in­differ­ence cor­rigi­bil­ity, there might be ways to turn that back into act-based cor­rigi­bil­ity.

First, how­ever, if we want to be able to ver­ify in­differ­ence cor­rigi­bil­ity, we are go­ing to need a mechanis­tic un­der­stand­ing of it. Let’s sup­pose we ac­cept the ar­gu­ment that a com­pet­i­tive model will likely be a mesa-op­ti­mizer such that it will be run­ning some sort of op­ti­miza­tion pro­cess cou­pled with some sort of mesa-ob­jec­tive. Fur­ther­more, let’s sup­pose that its op­ti­miza­tion pro­cess is com­puted over some sort of world model, which I think is a rea­son­able as­sump­tion for any com­pet­i­tive mesa-op­ti­mizer that wants to be able to take ac­tions in the world. In such a case, we can think of in­differ­ence cor­rigi­bil­ity as a form of stop gra­di­ent in the mesa-op­ti­mizer’s world model which its op­ti­miza­tion pro­cess doesn’t op­ti­mize through,[2] which is a highly mechanis­tic prop­erty that we might ex­pect an am­plified over­seer to be able to ver­ify.

Again, how­ever, we need agents with stop gra­di­ents to form a nat­u­ral enough class such that they are not that hard to find. While it might naively seem as if agents with stop gra­di­ents would be very rare, in fact the ex­act op­po­site might be the case. Con­sider an agent such as the mesa-op­ti­mizer de­scribed pre­vi­ously which is run­ning an op­ti­miza­tion pro­cess over some world model, but which has prefer­ences over the state of the world rather than the state of its world model. Such an agent will have an in­cen­tive to keep its world model al­igned with the world, but only in one di­rec­tion: it will want to mod­ify its world model to match the world, but not mod­ify the world to match its world model. This is a type of stop gra­di­ent, as it is a re­stric­tion on the agent to only min­i­mize the differ­ence be­tween its world model and the world through one path­way but not through an­other. This ex­am­ple seems to sug­gest that stop gra­di­ents aren’t that un­com­mon among agents such that it might be fairly achiev­able to find such an agent.

Even if we can train and ver­ify in­differ­ence cor­rigi­bil­ity, how­ever, one of the big re­main­ing prob­lems with such an ap­proach is that an in­differ­ence cor­rigible mesa-op­ti­mizer might do search in such a way that it finds a policy which is not it­self in­differ­ence cor­rigible. We can think about this prob­lem as aris­ing from the fact that in­differ­ence cor­rigi­bil­ity doesn’t tile, though I more of­ten think of this prob­lem as the prob­lem of “for­ward­ing the guaran­tee,” wherein even if we know how to do safe search on the out­side, we have to make sure that any search in­side of our model is similarly safe as well. This is a par­tic­u­larly in­ter­est­ing type of in­ner al­ign­ment prob­lem in that, rather than be­ing about how to en­sure that a mesa-op­ti­mizer’s ob­jec­tive is al­igned, it is about how to en­sure that a mesa-op­ti­mizer’s search is safe even given that its ob­jec­tive is al­igned. How­ever, it seems plau­si­ble that this sort of prob­lem could be re­solved by en­sur­ing that the model has a meta-prefer­ence to­wards any poli­cies it pro­duces also re­spect­ing the same stop gra­di­ent. In par­tic­u­lar, the over­seer could ver­ify that any search over poli­cies done by the model en­force the con­straint that ev­ery policy have such a stop gra­di­ent.

Even once we can ver­ify that our model is in­differ­ence cor­rigible and that it will for­ward that guaran­tee to any other search it might perform, how­ever, there is still the ques­tion of how we might be able to use such a mechanism to pro­duce a safe sys­tem. One way in which in­differ­ence cor­rigi­bil­ity could be used to pro­duce safety is to en­force that our model be­have my­opi­cally. We can think of a my­opic agent as one that only con­sid­ers how best to an­swer the sin­gle ques­tion that you give to it rather than con­sid­er­ing any sort of long-term con­se­quences, which can be im­ple­mented as a stop gra­di­ent pre­vent­ing any op­ti­miza­tion out­side of that do­main. While my­opia on its own is still in­suffi­cient to guaran­tee ac­cept­abil­ity, it seems like it would at least pre­vent de­cep­tive al­ign­ment, as one of the con­di­tions for de­cep­tive al­ign­ment is that the mesa-op­ti­mizer must have some­thing to gain from co­op­er­at­ing now and then defect­ing later, which is not true for a my­opic agent. Thus, if di­rected at a task which we are con­fi­dent is outer al­igned, such as pure su­per­vised am­plifi­ca­tion (train­ing a model to ap­prox­i­mate a hu­man con­sult­ing that model), and com­bined with a scheme for pre­vent­ing stan­dard pseudo-al­ign­ment (such as ad­ver­sar­ial train­ing), my­opia ver­ifi­ca­tion might be suffi­cient to re­solve the rest of the in­ner al­ign­ment prob­lem by pre­vent­ing de­cep­tive al­ign­ment.


If we want to be able to do re­laxed ad­ver­sar­ial train­ing to pro­duce safe AI sys­tems, we are go­ing to need a no­tion of ac­cept­abil­ity which is not that hard to train and ver­ify. Cor­rigi­bil­ity seems to be one of the most promis­ing can­di­dates for such an ac­cept­abil­ity con­di­tion, but for that to work we need a mechanis­tic un­der­stand­ing of ex­actly what sort of cor­rigi­bil­ity we’re shoot­ing for and how it will en­sure safety. I think that both of the paths con­sid­ered here—both act-based cor­rigi­bil­ity and in­differ­ence cor­rigi­bil­ity—look like promis­ing re­search di­rec­tions for at­tack­ing this prob­lem.

  1. Or at least all mod­els that we can find that satisfy that prop­erty. ↩︎

  2. Thanks to Scott Garrabrant for the stop gra­di­ent anal­ogy. ↩︎