(Warn­ing: ram­bling.)

I would like to build AI sys­tems which help me:

  • Figure out whether I built the right AI and cor­rect any mis­takes I made

  • Re­main in­formed about the AI’s be­hav­ior and avoid un­pleas­ant surprises

  • Make bet­ter de­ci­sions and clar­ify my preferences

  • Ac­quire re­sources and re­main in effec­tive con­trol of them

  • En­sure that my AI sys­tems con­tinue to do all of these nice things

  • …and so on

We say an agent is cor­rigible (ar­ti­cle on Ar­bital) if it has these prop­er­ties. I be­lieve this con­cept was in­tro­duced in the con­text of AI by Eliezer and named by Robert Miles; it has of­ten been dis­cussed in the con­text of nar­row be­hav­iors like re­spect­ing an off-switch, but here I am us­ing it in the broad­est pos­si­ble sense.

In this post I claim:

  1. A be­nign act-based agent will be ro­bustly cor­rigible if we want it to be.

  2. A suffi­ciently cor­rigible agent will tend to be­come more cor­rigible and be­nign over time. Cor­rigi­bil­ity marks out a broad basin of at­trac­tion to­wards ac­cept­able out­comes.

As a con­se­quence, we shouldn’t think about al­ign­ment as a nar­row tar­get which we need to im­ple­ment ex­actly and pre­serve pre­cisely. We’re aiming for a broad basin, and try­ing to avoid prob­lems that could kick out of that basin.

This view is an im­por­tant part of my over­all op­ti­mism about al­ign­ment, and an im­por­tant back­ground as­sump­tion in some of my writ­ing.

1. Benign act-based agents can be corrigible

A be­nign agent op­ti­mizes in ac­cor­dance with our prefer­ences. An act-baseda­gent con­sid­ers our short-term prefer­ences, in­clud­ing (amongst oth­ers) our prefer­ence for the agent to be cor­rigible.

If on av­er­age we are un­happy with the level of cor­rigi­bil­ity of a be­nign act-based agent, then by con­struc­tion it is mis­taken about our short-term prefer­ences.

This kind of cor­rigi­bil­ity doesn’t re­quire any spe­cial ma­chin­ery. An act-based agent turns off when the over­seer presses the “off” but­ton not be­cause it has re­ceived new ev­i­dence, or be­cause of del­i­cately bal­anced in­cen­tives. It turns off be­cause that’s what the over­seer prefers.

Con­trast with the usual fu­tur­ist perspective

Omo­hun­dro’s The Ba­sic AI Drives ar­gues that “al­most all sys­tems [will] pro­tect their util­ity func­tions from mod­ifi­ca­tion,” and Soares, Fallen­stein, Yud­kowsky, and Arm­strong cite as: “al­most all [ra­tio­nal] agents are in­stru­men­tally mo­ti­vated to pre­serve their prefer­ences.” This mo­ti­vates them to con­sider mod­ifi­ca­tions to an agent to re­move this de­fault in­cen­tive.

Act-based agents are gen­er­ally an ex­cep­tion to these ar­gu­ments, since the over­seer has prefer­ences about whether the agent pro­tects its util­ity func­tion from mod­ifi­ca­tion. Omo­hun­dro pre­sents prefer­ences-about-your-util­ity func­tion case as a some­what patholog­i­cal ex­cep­tion, but I sus­pect that it will be the typ­i­cal state of af­fairs for pow­er­ful AI (as for hu­mans) and it does not ap­pear to be un­sta­ble. It’s also very easy to im­ple­ment in 2017.

Is act-based cor­rigi­bil­ity ro­bust?

How is cor­rigi­bil­ity af­fected if an agent is ig­no­rant or mis­taken about the over­seer’s prefer­ences?

I think you don’t need par­tic­u­larly ac­cu­rate mod­els of a hu­man’s prefer­ences be­fore you can pre­dict that they want their robot to turn off when they press the off but­ton or that they don’t want to be lied to.

In the con­crete case of an ap­proval-di­rected agent, “hu­man prefer­ences” are rep­re­sented by hu­man re­sponses to ques­tions of the form “how happy would you be if I did a?” If the agent is con­sid­er­ing the ac­tion a pre­cisely be­cause it is ma­nipu­la­tive or would thwart the user’s at­tempts to cor­rect the sys­tem, then it doesn’t seem hard to pre­dict that the over­seer will ob­ject to a.

Eliezer has sug­gested that this is a very an­thro­pocen­tric judg­ment of “eas­i­ness.” I don’t think that’s true — I think that given a de­scrip­tion of a pro­posed course of ac­tion, the judg­ment “is agent X be­ing mis­led?” is ob­jec­tively a rel­a­tively easy pre­dic­tion prob­lem (com­pared to the com­plex­ity of gen­er­at­ing a strate­gi­cally de­cep­tive course of ac­tion).

For­tu­nately this is the kind of thing that we will get a great deal of ev­i­dence about long in ad­vance. Failing to pre­dict the over­seer be­comes less likely as your agent be­comes smarter, not more likely. So if in the near fu­ture we build sys­tems that make good enough pre­dic­tions to be cor­rigible, then we can ex­pect their su­per­in­tel­li­gent suc­ces­sors to have the same abil­ity.

(This dis­cus­sion mostly ap­plies on the train­ing dis­tri­bu­tion and sets aside is­sues of ro­bust­ness/​re­li­a­bil­ity of the pre­dic­tor it­self, for which I think ad­ver­sar­ial train­ing is the most plau­si­ble solu­tion. This is­sue will ap­ply to any ap­proach to cor­rigi­bil­ity which in­volves ma­chine learn­ing, which I think in­cludes any re­al­is­tic ap­proach.)

Is in­stru­men­tal cor­rigi­bil­ity ro­bust?

If an agent shares the over­seer’s long-term val­ues and is cor­rigible in­stru­men­tally, a slight di­ver­gence in val­ues would turn the agent and the over­seer into ad­ver­saries and to­tally break cor­rigi­bil­ity. This can also hap­pen with a frame­work like CIRL — if the way the agent in­fers the over­seer’s val­ues is slightly differ­ent from what the over­seer would con­clude upon re­flec­tion (which seems quite likely when the agent’s model is mis­speci­fied, as it in­evitably will be!) then we have a similar ad­ver­sar­ial re­la­tion­ship.

2. Cor­rigible agents be­come more cor­rigible/​aligned

In gen­eral, an agent will pre­fer to build other agents that share its prefer­ences. So if an agent in­her­its a dis­torted ver­sion of the over­seer’s prefer­ences, we might ex­pect that dis­tor­tion to per­sist (or to drift fur­ther if sub­se­quent agents also fail to pass on their val­ues cor­rectly).

But a cor­rigible agent prefers to build other agents that share the over­seer’sprefer­ences — even if the agent doesn’t yet share the over­seer’s prefer­ences perfectly. After all, even if you only ap­prox­i­mately know the over­seer’s prefer­ences, you know that the over­seer would pre­fer the ap­prox­i­ma­tion get bet­ter rather than worse.

Thus an en­tire neigh­bor­hood of pos­si­ble prefer­ences lead the agent to­wards the same basin of at­trac­tion. We just have to get “close enough” that we are cor­rigible, we don’t need to build an agent which ex­actly shares hu­man­ity’s val­ues, philo­soph­i­cal views, or so on.

In ad­di­tion to mak­ing the ini­tial tar­get big­ger, this gives us some rea­son to be op­ti­mistic about the dy­nam­ics of AI sys­tems iter­a­tively de­sign­ing new AI sys­tems. Cor­rigible sys­tems want to de­sign more cor­rigible and more ca­pa­ble suc­ces­sors. Rather than our sys­tems travers­ing a bal­ance beam off of which they could fall at any mo­ment, we can view them as walk­ing along the bot­tom of a rav­ine. As long as they don’t jump to a com­pletely differ­ent part of the land­scape, they will con­tinue travers­ing the cor­rect path.

This is all a bit of a sim­plifi­ca­tion (though I think it gives the right idea). In re­al­ity the space of pos­si­ble er­rors and per­tur­ba­tions carves out a low de­gree man­i­fold in the space of all pos­si­ble minds. Un­doubt­edly there are “small” per­tur­ba­tions in the space of pos­si­ble minds which would lead to the agent fal­ling off the bal­ance beam. The task is to parametrize our agents such that the man­i­fold of likely-suc­ces­sors is re­stricted to the part of the space that looks more like a rav­ine. In the last sec­tion I ar­gued that act-based agents ac­com­plish this, and I’m sure there are al­ter­na­tive ap­proaches.


Cor­rigi­bil­ity also pro­tects us from grad­ual value drift dur­ing ca­pa­bil­ity am­plifi­ca­tion. As we build more pow­er­ful com­pound agents, their val­ues may effec­tively drift. But un­less the drift is large enough to dis­rupt cor­rigi­bil­ity, the com­pound agent will con­tinue to at­tempt to cor­rect and man­age that drift.

This is an im­por­tant part of my op­ti­mism about am­plifi­ca­tion. It’s what makes it co­her­ent to talk about pre­serv­ing be­nig­nity as an in­duc­tive in­var­i­ant, even when “be­nign” ap­pears to be such a slip­pery con­cept. It’s why it makes sense to talk about re­li­a­bil­ity and se­cu­rity as if be­ing “be­nign” was a boolean prop­erty.

In all these cases I think that I should ac­tu­ally have been ar­gu­ing for cor­rigi­bil­ity rather than be­nig­nity. The ro­bust­ness of cor­rigi­bil­ity means that we can po­ten­tially get by with a good enough for­mal­iza­tion, rather than need­ing to get it ex­actly right. The fact that cor­rigi­bil­ity is a basin of at­trac­tion al­lows us to con­sider failures as dis­crete events rather than wor­ry­ing about slight per­tur­ba­tions. And the fact that cor­rigi­bil­ity even­tu­ally leads to al­igned be­hav­ior means that if we could in­duc­tively es­tab­lish cor­rigi­bil­ity, then we’d be happy.

This is still not quite right and not at all for­mal, but hope­fully it’s get­ting closer to my real rea­sons for op­ti­mism.


I think that many fu­tur­ists are way too pes­simistic about al­ign­ment. Part of that pes­simism seems to stem from a view like “any false move leads to dis­aster.” While there are some kinds of mis­takes that clearly do lead to dis­aster, I also think it is pos­si­ble to build the kind of AI where prob­a­bleper­tur­ba­tions or er­rors will be grace­fully cor­rected. In this post I tried to in­for­mally flesh out my view. I don’t ex­pect this to be com­pletely con­vinc­ing, but I hope that it can help my more pes­simistic read­ers un­der­stand where I am com­ing from.

Postscript: the hard prob­lem of cor­rigi­bil­ity and the diff of my and Eliezer’s views

I share many of Eliezer’s in­tu­itions re­gard­ing the “hard prob­lem of cor­rigi­bil­ity” (I as­sume that Eliezer wrote this ar­ti­cle). Eliezer’s in­tu­ition that there is a “sim­ple core” to cor­rigi­bil­ity cor­re­sponds to my in­tu­ition that cor­rigible be­hav­ior is easy to learn in some non-an­thro­po­mor­phic sense.

I don’t ex­pect that we will be able to spec­ify cor­rigi­bil­ity in a sim­ple but al­gorith­mi­cally use­ful way, nor that we need to do so. In­stead, I am op­ti­mistic that we can build agents which learn to rea­son by hu­man su­per­vi­sion over rea­son­ing steps, which pick up cor­rigi­bil­ity along with the other use­ful char­ac­ter­is­tics of rea­son­ing.

Eliezer ar­gues that we shouldn’t rely on a solu­tion to cor­rigi­bil­ity un­less it is sim­ple enough that we can for­mal­ize and san­ity-check it our­selves, even if it ap­pears that it can be learned from a small num­ber of train­ing ex­am­ples, be­cause an “AI that seemed cor­rigible in its in­frahu­man phase [might] sud­denly [de­velop] ex­treme or un­fore­seen be­hav­iors when the same allegedly sim­ple cen­tral prin­ci­ple was re­con­sid­ered at a higher level of in­tel­li­gence.”

I don’t buy this ar­gu­ment be­cause I dis­agree with im­plicit as­sump­tions about how such prin­ci­ples will be em­bed­ded in the rea­son­ing of our agent. For ex­am­ple, I don’t think that this prin­ci­ple would af­fect the agent’s rea­son­ing by be­ing ex­plic­itly con­sid­ered. In­stead it would in­fluence the way that the rea­son­ing it­self worked. It’s pos­si­ble that af­ter trans­lat­ing be­tween our differ­ing as­sump­tions, my en­thu­si­asm about em­bed­ding cor­rigi­bil­ity deeply in rea­son­ing cor­re­sponds to Eliezer’s en­thu­si­asm about “lots of par­tic­u­lar cor­rigi­bil­ity prin­ci­ples.”

I feel that my cur­rent ap­proach is a rea­son­able an­gle of at­tack on the hard prob­lem of cor­rigi­bil­ity, and that we can cur­rently write code which is rea­son­ably likely to solve the prob­lem (though not know­ably). I do not feel like we yet have cred­ible al­ter­na­tives.

I do grant that if we need to learn cor­rigible rea­son­ing, then it is vuln­er­a­ble to failures of ro­bust­ness/​re­li­a­bil­ity, and so learned cor­rigi­bil­ity is not it­self an ad­e­quate pro­tec­tion against failures of ro­bust­ness/​re­li­a­bil­ity. I could imag­ine other forms of cor­rigi­bil­ity that do offer such pro­tec­tion, but it does not seem like the most promis­ing ap­proach to ro­bust­ness/​re­li­a­bil­ity.

I do think that it’s rea­son­ably likely (maybe 50–50) that there is some clean con­cept of “cor­rigi­bil­ity” which (a) we can ar­tic­u­late in ad­vance, and (b) plays an im­por­tant role in our anal­y­sis of AI sys­tems, if not in their con­struc­tion.

This was origi­nally posted here on 10th June 2017.

The next post in the se­quence on ‘Iter­ated Am­plifi­ca­tion’ will be ‘Iter­ated Distil­la­tion and Am­plifi­ca­tion’ by Ajeya Co­tra.

To­mor­row’s AI Align­ment Fo­rum se­quences posts will be 4 posts of agent foun­da­tions re­search, in the se­quence ‘Fixed Points’.