Non-Obstruction: A Simple Concept Motivating Corrigibility

Thanks to Mathias Bonde, Tif­fany Cai, Ryan Carey, Michael Co­hen, An­drew Critch, Abram Dem­ski, Michael Den­nis, Thomas Gilbert, Matthew Graves, Koen Holt­man, Evan Hub­inger, Vic­to­ria Krakovna, Amanda Ngo, Ro­hin Shah, Adam Shimi, Lo­gan Smith, and Mark Xu for their thoughts.

Main claim: cor­rigi­bil­ity’s benefits can be math­e­mat­i­cally rep­re­sented as a coun­ter­fac­tual form of al­ign­ment.

Overview: I’m go­ing to talk about a unified math­e­mat­i­cal frame I have for un­der­stand­ing cor­rigi­bil­ity’s benefits, what it “is”, and what it isn’t. This frame is pre­cisely un­der­stood by graph­ing the hu­man over­seer’s abil­ity to achieve var­i­ous goals (their at­tain­able util­ity (AU) land­scape). I ar­gue that cor­rigi­bil­ity’s benefits are se­cretly a form of coun­ter­fac­tual al­ign­ment (al­ign­ment with a set of goals the hu­man may want to pur­sue).

A coun­ter­fac­tu­ally al­igned agent doesn’t have to let us liter­ally cor­rect it. Rather, this frame the­o­ret­i­cally mo­ti­vates why we might want cor­rigi­bil­ity any­ways. This frame also mo­ti­vates other AI al­ign­ment sub­prob­lems, such as in­tent al­ign­ment, mild op­ti­miza­tion, and low im­pact.


Cor­rigi­bil­ity goes by a lot of con­cepts: “not in­cen­tivized to stop us from shut­ting it off”, “wants to ac­count for its own flaws”, “doesn’t take away much power from us”, etc. Coined by Robert Miles, the word ‘cor­rigi­bil­ity’ means “able to be cor­rected [by hu­mans].” I’m go­ing to ar­gue that these are cor­re­lates of a key thing we plau­si­bly ac­tu­ally want from the agent de­sign, which seems con­cep­tu­ally sim­ple.

In this post, I take the fol­low­ing com­mon-lan­guage defi­ni­tions:

  • Cor­rigi­bil­ity: the AI liter­ally lets us cor­rect it (mod­ify its policy), and it doesn’t ma­nipu­late us ei­ther.

    • Without both of these con­di­tions, the AI’s be­hav­ior isn’t suffi­ciently con­strained for the con­cept to be use­ful. Be­ing able to cor­rect it is small com­fort if it ma­nipu­lates us into mak­ing the mod­ifi­ca­tions it wants. An AI which is only non-ma­nipu­la­tive doesn’t have to give us the chance to cor­rect it or shut it down.

  • Im­pact al­ign­ment: the AI’s ac­tual im­pact is al­igned with what we want. De­ploy­ing the AI ac­tu­ally makes good things hap­pen.

  • In­tent al­ign­ment: the AI makes an hon­est effort to figure out what we want and to make good things hap­pen.

I think that these defi­ni­tions fol­low what their words mean, and that the al­ign­ment com­mu­nity should use these (or other clear ground­ings) in gen­eral. Two of the more im­por­tant con­cepts in the field (al­ign­ment and cor­rigi­bil­ity) shouldn’t have am­bigu­ous and varied mean­ings. If the above defi­ni­tions are un­satis­fac­tory, I think we should set­tle upon bet­ter ones as soon as pos­si­ble. If that would be pre­ma­ture due to con­fu­sion about the al­ign­ment prob­lem, we should define as much as we can now and ex­plic­itly note what we’re still con­fused about.

We cer­tainly shouldn’t keep us­ing 2+ defi­ni­tions for both al­ign­ment and cor­rigi­bil­ity. Some peo­ple have even stopped us­ing ‘cor­rigi­bil­ity’ to re­fer to cor­rigi­bil­ity! I think it would be bet­ter for us to define the be­hav­ioral crite­rion (e.g. as I defined ‘cor­rigi­bil­ity’), and then define mechanis­tic ways of get­ting that crite­rion (e.g. in­tent cor­rigi­bil­ity). We can have lots of con­cepts, but they should each have differ­ent names.

Evan Hub­inger re­cently wrote a great FAQ on in­ner al­ign­ment ter­minol­ogy. We won’t be talk­ing about in­ner/​outer al­ign­ment to­day, but I in­tend for my us­age of “im­pact al­ign­ment” to map onto his “al­ign­ment”, and “in­tent al­ign­ment” to map onto his us­age of “in­tent al­ign­ment.” Similarly, my us­age of “im­pact/​in­tent al­ign­ment” di­rectly al­igns with the defi­ni­tions from An­drew Critch’s re­cent post, Some AI re­search ar­eas and their rele­vance to ex­is­ten­tial safety.

A Sim­ple Con­cept Mo­ti­vat­ing Corrigibility

Two con­cep­tual clarifications

Cor­rigi­bil­ity with re­spect to a set of goals

I find it use­ful to not think of cor­rigi­bil­ity as a bi­nary prop­erty, or even as ex­ist­ing on a one-di­men­sional con­tinuum. I of­ten think about cor­rigi­bil­ity with re­spect to a set of pay­off func­tions. (This isn’t always the right ab­strac­tion: there are plenty of poli­cies which don’t care about pay­off func­tions. I still find it use­ful.)

For ex­am­ple, imag­ine an AI which let you cor­rect it if and only if it knows you aren’t a tor­ture-max­i­mizer. We’d prob­a­bly still call this AI “cor­rigible [to us]”, even though it isn’t cor­rigible to some pos­si­ble de­signer. We’d still be fine, as­sum­ing it has ac­cu­rate be­liefs.

Cor­rigi­bil­ity != alignment

Here’s an AI which is nei­ther im­pact nor in­tent al­igned, but which is cor­rigible. Each day, the AI ran­domly hurts one per­son in the world, and oth­er­wise does noth­ing. It’s cor­rigible be­cause it doesn’t pre­vent us from shut­ting it off or mod­ify­ing it.

Non-ob­struc­tion: the AI doesn’t ham­per coun­ter­fac­tual achieve­ment of a set of goals

Imag­ine we’re play­ing a two-player ex­ten­sive-form game with the AI, and we’re con­sid­er­ing whether to ac­ti­vate it.

The hu­man moves on black, and the AI moves on red.

This is a triv­ial game, but you can imag­ine more com­plex games, where the AI can em­power or dis­em­power the hu­man, steer the fu­ture ex­actly where it wants, or let the hu­man take over at any point.

The mil­lion-dol­lar ques­tion is: will the AI get in our way and fight with us all the way down the game tree? If we mis­spec­ify some de­tail, will it make it­self a fix­ture in our world, con­stantly steer­ing to­wards fu­tures we don’t want? If we like dogs, will the AI force pan­cakes upon us?

One way to guard against this is by hav­ing it let us cor­rect it, and want to let us cor­rect it, and want to want to let us cor­rect it… But what we re­ally want is for it to not get in our way for some (pos­si­bly broad) set of goals .

We’ll for­mal­ize ‘goals’ as pay­off func­tions, al­though I’ll use ‘goals’ and ‘pay­off func­tions’ in­ter­change­ably. As is stan­dard in game the­ory, pay­off func­tions are real-val­ued func­tions on the leaf nodes.

Let’s say the AI is non-ob­struc­tive with re­spect to when ac­ti­vat­ing it doesn’t de­crease our abil­ity to achieve any goal in (the on state, above), com­pared to not ac­ti­vat­ing it (off).

Does ac­ti­vat­ing the AI de­crease the -value at­tained by the hu­man, for all of these differ­ent goals the hu­man might coun­ter­fac­tu­ally pur­sue?

The hu­man’s got a policy func­tion , which takes in a goal and re­turns a policy for that goal. If is “paint walls blue”, then the policy is the hu­man’s best plan for paint­ing walls blue. de­notes the ex­pected value that policy ob­tains for goal , start­ing from state and given that the AI fol­lows policy .

Defi­ni­tion 1: Non-ob­struc­tion. An AI is non-ob­struc­tive with re­spect to pay­off func­tion set if the AI’s policy satisfies

is the hu­man’s at­tain­able util­ity (AU) for goal at state , again given the AI policy. Ba­si­cally, this quan­tifies the ex­pected pay­off for goal , given that the AI acts in such-and-such a way, and that the player fol­lows policy start­ing from state .

This math ex­presses a sim­ple sen­ti­ment: turn­ing on the AI doesn’t make you, the hu­man, worse off for any goal . The in­equal­ity doesn’t have to be ex­act, it could just be for some -de­crease (to avoid triv­ial coun­terex­am­ples). Also, we’d tech­ni­cally want to talk about non-ob­struc­tion be­ing pre­sent through­out the on-sub­tree, but let’s keep it sim­ple for now.

The hu­man moves on black, and the AI moves on red.

Sup­pose that leads to pan­cakes:

Since tran­si­tions to pan­cakes, then , the pay­off for the state in which the game finishes if the AI fol­lows policy and the hu­man fol­lows policy . If , then turn­ing on the AI doesn’t make the hu­man worse off for goal .

If as­signs the most pay­off to pan­cakes, we’re in luck. But what if we like dogs? If we keep the AI turned off, can go to donuts or dogs de­pend­ing on what rates more highly. Cru­cially, even though we can’t do as much as the AI (we can’t reach pan­cakes on our own), if we don’t turn the AI on, our prefer­ences still con­trol how the world ends up.

This game tree isn’t re­ally fair to the AI. In a sense, it can’t not be in our way:

  • If leads to pan­cakes, then it ob­structs pay­off func­tions which give strictly more pay­off for donuts or dogs.

  • If leads to donuts, then it ob­structs pay­off func­tions which give strictly more pay­off to dogs.

  • If leads to dogs, then it ob­structs pay­off func­tions which give strictly more pay­off to donuts.

Once we’ve turned the AI on, the fu­ture stops hav­ing any mu­tual in­for­ma­tion with our prefer­ences . Every­thing come down to whether we pro­grammed cor­rectly: to whether the AI is im­pact-al­igned with our goals !

In con­trast, the idea be­hind non-ob­struc­tion is that we still re­main able to course-cor­rect the fu­ture, coun­ter­fac­tu­ally nav­i­gat­ing to ter­mi­nal states we find valuable, de­pend­ing on what our pay­off is. But how could an AI be non-ob­struc­tive, if it only has one policy which can’t di­rectly de­pend on our goal ? Since the hu­man’s policy does di­rectly de­pend on , the AI can pre­serve value for lots of goals in the set by let­ting us main­tain some con­trol over the fu­ture.

Let and con­sider the real world. Calcu­la­tors are non-ob­struc­tive with re­spect to , as are mod­ern-day AIs. Paper­clip max­i­miz­ers are highly ob­struc­tive. Ma­nipu­la­tive agents are ob­struc­tive (they trick the hu­man poli­cies into steer­ing to­wards non-re­flec­tively-en­dorsed leaf nodes). An ini­tial-hu­man-val­ues-al­igned dic­ta­tor AI ob­structs most goals. Sub-hu­man-level AI which chip away at our au­ton­omy and con­trol over the fu­ture, are ob­struc­tive as well.

This can seem­ingly go off the rails if you con­sider e.g. a friendly AGI to be “ob­struc­tive” be­cause ac­ti­vat­ing it hap­pens to deto­nate a nu­clear bomb via the but­terfly effect. Or, we’re already doomed in off (an un­friendly AGI will come along soon af­ter), and so then this AI is “not ob­struc­tive” if it kills us in­stead. This is an im­pact/​in­tent is­sue—ob­struc­tion is here defined ac­cord­ing to im­pact al­ign­ment.

To em­pha­size, we’re talk­ing about what would ac­tu­ally hap­pen if we de­ployed the AI, un­der differ­ent hu­man policy coun­ter­fac­tu­als—would the AI “get in our way”, or not? This ac­count is de­scrip­tive, not pre­scrip­tive; I’m not say­ing we ac­tu­ally get the AI to rep­re­sent the hu­man in its model, or that the AI’s model of re­al­ity is cor­rect, or any­thing.

We’ve just got two play­ers in an ex­ten­sive-form game, and a hu­man policy func­tion which can be com­bined with differ­ent goals, and a hu­man whose goal is rep­re­sented as a pay­off func­tion. The AI doesn’t even have to be op­ti­miz­ing a pay­off func­tion; we sim­ply as­sume it has a policy. The idea that a hu­man has an ac­tual pay­off func­tion is un­re­al­is­tic; all the same, I want to first un­der­stand cor­rigi­bil­ity and al­ign­ment in two-player ex­ten­sive-form games.

Lastly, pay­off func­tions can some­times be more or less gran­u­lar than we’d like, since they only grade the leaf nodes. This isn’t a big deal, since I’m only con­sid­er­ing ex­ten­sive-form games for con­cep­tual sim­plic­ity. We also gen­er­ally re­strict our­selves to con­sid­er­ing goals which aren’t silly: for ex­am­ple, any AI ob­structs the “no AI is ac­ti­vated, ever” goal.

Align­ment flexibility

Main idea: By con­sid­er­ing how the AI af­fects your at­tain­able util­ity (AU) land­scape, you can quan­tify how helpful and flex­ible an AI is.

Let’s con­sider the hu­man’s abil­ity to ac­com­plish many differ­ent goals P, first from the state off (no AI).

The hu­man’s AU land­scape. The real goal space is high-di­men­sional, but it shouldn’t ma­te­ri­ally change the anal­y­sis. Also, there are prob­a­bly a few goals we can’t achieve well at all, be­cause they put low pay­off ev­ery­where, but the vast ma­jor­ity of goals aren’t like that.

The in­de­pen­dent vari­able is , and the value func­tion takes in and re­turns the ex­pected value at­tained by the policy for that goal, . We’re able to do a bunch of differ­ent things with­out the AI, if we put our minds to it.

Non-tor­ture AI

Imag­ine we build an AI which is cor­rigible to­wards all non-pro-tor­ture goals, which is spe­cial­ized to­wards paint­ing lots of things blue with us (if we so choose), but which is oth­er­wise non-ob­struc­tive. It even helps us ac­cu­mu­late re­sources for many other goals.

The AI is non-ob­struc­tive with re­spect to if ’s red value is greater than its green value.

We can’t get around the AI, as far as tor­ture goes. But for the other goals, it isn’t ob­struct­ing their poli­cies. It won’t get in our way for other goals.


What hap­pens if we turn on a pa­per­clip-max­i­mizer? We lose con­trol over the fu­ture out­side of a very nar­row spiky re­gion.

The pa­per­clip­per is in­cor­rigible and ob­structs us for all goals ex­cept pa­per­clip pro­duc­tion.

I think most re­ward-max­i­miz­ing op­ti­mal poli­cies af­fect the land­scape like this (see also: the catas­trophic con­ver­gence con­jec­ture), which is why it’s so hard to get hard max­i­miz­ers not to ruin ev­ery­thing. You have to a) hit a tiny tar­get in the AU land­scape and b) hit that for the hu­man’s AU, not for the AI’s. The spik­i­ness is bad and, seem­ingly, hard to deal with.

Fur­ther­more, con­sider how the above graph changes as gets smarter and smarter. If we were ac­tu­ally su­per-su­per­in­tel­li­gent our­selves, then ac­ti­vat­ing a su­per­in­tel­li­gent pa­per­clip­per might not even a big deal, and most of our AUs are prob­a­bly un­changed. The AI policy isn’t good enough to nega­tively im­pact us, and so it can’t ob­struct us. Spik­i­ness de­pends on both the AI’s policy, and on .

Em­pow­er­ing AI

What if we build an AI which sig­nifi­cantly em­pow­ers us in gen­eral, and then it lets us de­ter­mine our fu­ture? Sup­pose we can’t cor­rect it.

I think it’d be pretty odd to call this AI “in­cor­rigible”, even though it’s liter­ally in­cor­rigible. The con­no­ta­tions are all wrong. Fur­ther­more, it isn’t “try­ing to figure out what we want and then do it”, or “try­ing to help us cor­rect it in the right way.” It’s not cor­rigible. It’s not in­tent al­igned. So what is it?

It’s em­pow­er­ing and, more weakly, it’s non-ob­struc­tive. Non-ob­struc­tion is just a diffuse form of im­pact al­ign­ment, as I’ll talk about later.

Prac­ti­cally speak­ing, we’ll prob­a­bly want to be able to liter­ally cor­rect the AI with­out ma­nipu­la­tion, be­cause it’s hard to jus­tifi­ably know ahead of time that the AU land­scape is em­pow­er­ing, as above. There­fore, let’s build an AI we can mod­ify, just to be safe. This is a sep­a­rate con­cern, as our the­o­ret­i­cal anal­y­sis as­sumes that the AU land­scape is how it looks.

But this is also a case of cor­rigi­bil­ity just be­ing a proxy for what we want. We want an AI which leads to ro­bustly bet­ter out­comes (ei­ther through its own ac­tions, or through some other means), with­out re­li­ance on get­ting am­bi­tious value al­ign­ment ex­actly right with re­spect to our goals.

Con­clu­sions I draw from the idea of non-obstruction

  1. Try­ing to im­ple­ment cor­rigi­bil­ity is prob­a­bly a good in­stru­men­tal strat­egy for us to in­duce non-ob­struc­tion in an AI we de­signed.

    1. It will be prac­ti­cally hard to know an AI is ac­tu­ally non-ob­struc­tive for a wide set , so we’ll prob­a­bly want cor­rigi­bil­ity just to be sure.

  2. We (the al­ign­ment com­mu­nity) think we want cor­rigi­bil­ity with re­spect to some wide set of goals , but we ac­tu­ally want non-ob­struc­tion with re­spect to

    1. Gen­er­ally, satis­fac­tory cor­rigi­bil­ity with re­spect to im­plies non-ob­struc­tion with re­spect to ! If the mere act of turn­ing on the AI means you have to lose a lot of value in or­der to get what you wanted, then it isn’t cor­rigible enough.

      1. One ex­cep­tion: the AI moves so fast that we can’t cor­rect it in time, even though it isn’t in­clined to stop or ma­nipu­late us. In that case, cor­rigi­bil­ity isn’t enough, whereas non-ob­struc­tion is.

    2. Non-ob­struc­tion with re­spect to does not im­ply cor­rigi­bil­ity with re­spect to .

      1. But this is OK! In this sim­plified set­ting of “hu­man with ac­tual pay­off func­tion”, who cares whether it liter­ally lets us cor­rect it or not? We care about whether turn­ing it on ac­tu­ally ham­pers our goals.

      2. Non-ob­struc­tion should of­ten im­ply some form of cor­rigi­bil­ity, but these are the­o­ret­i­cally dis­tinct: an AI could just go hide out some­where in se­crecy and re­fund us its small en­ergy us­age, and then de­stroy it­self when we build friendly AGI.

    3. Non-ob­struc­tion cap­tures the cog­ni­tive abil­ities of the hu­man through the policy func­tion.

      1. To re­it­er­ate, this post out­lines a frame for con­cep­tu­ally an­a­lyz­ing the al­ign­ment prop­er­ties of an AI. We can’t ac­tu­ally figure out a goal-con­di­tioned hu­man policy func­tion, but that doesn’t mat­ter, be­cause this is a tool for con­cep­tual anal­y­sis, not an AI al­ign­ment solu­tion strat­egy. Any con­cep­tual anal­y­sis of im­pact al­ign­ment and cor­rigi­bil­ity which did not ac­count for hu­man cog­ni­tive abil­ities, would be ob­vi­ously flawed.

    4. By defi­ni­tion, non-ob­struc­tion with re­spect to pre­vents harm­ful ma­nipu­la­tion by pre­clud­ing worse out­comes with re­spect to .

      1. I con­sider ma­nipu­la­tive poli­cies to be those which ro­bustly steer the hu­man into tak­ing a cer­tain kind of ac­tion, in a way that’s ro­bust against the hu­man’s coun­ter­fac­tual prefer­ences.

        If I’m choos­ing which pair of shoes to buy, and I ask the AI for help, and no mat­ter what prefer­ences I had for shoes to be­gin with, I end up buy­ing blue shoes, then I’m prob­a­bly be­ing ma­nipu­lated (and ob­structed with re­spect to most of my prefer­ences over shoes!).

        A non-ma­nipu­la­tive AI would act in a way that lets me con­di­tion my ac­tions on my prefer­ences.

      2. I do have a for­mal mea­sure of cor­rigi­bil­ity which I’m ex­cited about, but it isn’t perfect. More on that in a fu­ture post.

    5. As a crite­rion, non-ob­struc­tion doesn’t rely on in­ten­tion­al­ity on the AI’s part. The defi­ni­tion also ap­plies to the down­stream effects of tool AIs, or even to hiring de­ci­sions!

    6. Non-ob­struc­tion is also con­cep­tu­ally sim­ple and easy to for­mal­ize, whereas literal cor­rigi­bil­ity gets mired in the se­man­tics of the game tree.

      1. For ex­am­ple, what’s “ma­nipu­la­tion”? As men­tioned above, I think there are some hints as to the an­swer, but it’s not clear to me that we’re even ask­ing the right ques­tions yet.

I think of “power” as “the hu­man’s av­er­age abil­ity to achieve goals from some dis­tri­bu­tion.” Log­i­cally, non-ob­struc­tive agents with re­spect to don’t de­crease our power with re­spect to any dis­tri­bu­tion over goal set . The catas­trophic con­ver­gence con­jec­ture says, “im­pact al­ign­ment catas­tro­phes tend to come from power-seek­ing be­hav­ior”; if the agent is non-ob­struc­tive with re­spect to a broad enough set of goals, it’s not steal­ing power from us, and so it likely isn’t catas­trophic.

Non-ob­struc­tion is im­por­tant for a (sin­gle­ton) AI we build: we get more than one shot to get it right. If it’s slightly wrong, it’s not go­ing to ruin ev­ery­thing. Mo­dulo other ac­tors, if you mess up the first time, you can just try again and get a strongly al­igned agent the next time.

Most im­por­tantly, this frame col­lapses the al­ign­ment and cor­rigi­bil­ity desider­ata into just al­ign­ment; while im­pact al­ign­ment doesn’t im­ply cor­rigi­bil­ity, cor­rigi­bil­ity’s benefits can be un­der­stood as a kind of weak coun­ter­fac­tual im­pact al­ign­ment with many pos­si­ble hu­man goals.

The­o­ret­i­cally, It’s All About Alignment

Main idea: We only care about how the agent af­fects our abil­ities to pur­sue differ­ent goals (our AU land­scape) in the two-player game, and not how that hap­pens. AI al­ign­ment sub­prob­lems (such as cor­rigi­bil­ity, in­tent al­ign­ment, low im­pact, and mild op­ti­miza­tion) are all in­stru­men­tal av­enues for mak­ing AIs which af­fect this AU land­scape in spe­cific de­sir­able ways.

For­mal­iz­ing im­pact al­ign­ment in ex­ten­sive-form games

Im­pact al­ign­ment: the AI’s ac­tual im­pact is al­igned with what we want. De­ploy­ing the AI ac­tu­ally makes good things hap­pen.

We care about events if and only if they change our abil­ity to get what we want. If you want to un­der­stand nor­ma­tive AI al­ign­ment desider­ata, on some level they have to ground out in terms of your abil­ity to get what you want (the AU the­ory of im­pact) - the good­ness of what ac­tu­ally ends up hap­pen­ing un­der your policy—and in terms of how other agents af­fect your abil­ity to get what you want (the AU land­scape). What else could we pos­si­bly care about, be­sides our abil­ity to get what we want?

Defi­ni­tion 2. For fixed hu­man policy func­tion , is:

  • Max­i­mally im­pact al­igned with goal if

  • Im­pact al­igned with goal if

  • (Im­pact) non-ob­struc­tive with re­spect to goal if .

  • Im­pact un­al­igned with goal if

  • Max­i­mally im­pact un­al­igned with goal if

Non-ob­struc­tion is a weak form of im­pact al­ign­ment.

As de­manded by the AU the­ory of im­pact, the im­pact on goal of turn­ing on the AI is

Again, im­pact al­ign­ment doesn’t re­quire in­ten­tion­al­ity. The AI might well grit its cir­cuits as it laments how Face­book_user5821 failed to share a “we wel­come our AI over­lords” meme, while still fol­low­ing an im­pact-al­igned policy.

How­ever, even if we could max­i­mally im­pact-al­ign the agent with any ob­jec­tive, we couldn’t just al­ign it with our ob­jec­tive. We don’t know our ob­jec­tive (again, in this set­ting, I’m as­sum­ing the hu­man ac­tu­ally has a “true” pay­off func­tion). There­fore, we should build an AI al­igned with many pos­si­ble goals we could have. If the AI doesn’t em­power us, it at least shouldn’t ob­struct us. There­fore, we should build an AI which defers to us, lets us cor­rect it, and which doesn’t ma­nipu­late us.

This is the key mo­ti­va­tion for cor­rigi­bil­ity.

For ex­am­ple, in­tent cor­rigi­bil­ity (try­ing to be the kind of agent which can be cor­rected and which is not ma­nipu­la­tive) is an in­stru­men­tal strat­egy for in­duc­ing cor­rigi­bil­ity, which is an in­stru­men­tal strat­egy for in­duc­ing broad non-ob­struc­tion, which is an in­stru­men­tal strat­egy for hedg­ing against our in­abil­ity to figure out what we want. It’s all about al­ign­ment.

Cor­rigi­bil­ity also in­creases ro­bust­ness against other AI de­sign er­rors. How­ever, it still just boils down to non-ob­struc­tion, and then to im­pact al­ign­ment: if the AI sys­tem has mean­ingful er­rors, then it’s not im­pact-al­igned with the AUs which we wanted it to be im­pact-al­igned with. In this set­ting, the AU land­scape cap­tures what ac­tu­ally would hap­pen for differ­ent hu­man goals .

To be con­fi­dent that this holds em­piri­cally, it sure seems like you want high er­ror tol­er­ance in the AI de­sign: one does not sim­ply know­ably build an AGI that’s helpful for many AUs. Hence, cor­rigi­bil­ity as an in­stru­men­tal strat­egy for non-ob­struc­tion.

AI al­ign­ment sub­prob­lems are about avoid­ing spik­i­ness in the AU landscape

By defi­ni­tion, spik­i­ness is bad for most goals.
  • Cor­rigi­bil­ity: avoid spik­i­ness by let­ting hu­mans cor­rect the AI if it starts do­ing stuff we don’t like, or if we change our mind.

    • This works be­cause the hu­man policy func­tion is far more likely to cor­rectly con­di­tion ac­tions on the hu­man’s goal, than it is to in­duce an AI policy which does the same (since the goal in­for­ma­tion is pri­vate to the hu­man).

    • En­forc­ing off-switch cor­rigi­bil­ity and non-ma­nipu­la­tion are in­stru­men­tal strate­gies for get­ting bet­ter diffuse al­ign­ment across goals and a wide range of de­ploy­ment situ­a­tions.

  • In­tent al­ign­ment: avoid spik­i­ness by hav­ing the AI want to be flex­ibly al­igned with us and broadly em­pow­er­ing.

    • Basin of in­tent al­ign­ment: smart, nearly in­tent-al­igned AIs should mod­ify them­selves to be more and more in­tent-al­igned, even if they aren’t perfectly in­tent-al­igned to be­gin with.

      • In­tu­ition: If we can build a smarter mind which ba­si­cally wants to help us, then can’t the smarter mind also build a yet smarter agent which still ba­si­cally wants to help it (and there­fore, help us)?

      • Paul Chris­ti­ano named this the “basin of cor­rigi­bil­ity”, but I don’t like that name be­cause only a few of the named desider­ata ac­tu­ally cor­re­spond to the nat­u­ral defi­ni­tion of “cor­rigi­bil­ity.” This then over­loads “cor­rigi­bil­ity” with the re­spon­si­bil­ities of “in­tent al­ign­ment.”

  • Low im­pact: find a max­i­miza­tion crite­rion which leads to non-spik­i­ness.

    • Goal of meth­ods: to reg­u­larize de­crease from green line (for off) for true un­known goal ; since we don’t know , we aim to just reg­u­larize de­crease from the green line in gen­eral (to avoid de­creas­ing the hu­man’s abil­ity to achieve var­i­ous goals).

    • The first two-thirds of Refram­ing Im­pact ar­gued that power-seek­ing in­cen­tives play a big part in mak­ing AI al­ign­ment hard. In the util­ity-max­i­miza­tion AI de­sign paradigm, in­stru­men­tal sub­goals are always ly­ing in wait. They’re always wait­ing for one mis­take, one mis­speci­fi­ca­tion in your ex­plicit re­ward sig­nal, and then bang—the AU land­scape is spiky. Game over.

  • Mild op­ti­miza­tion: avoid spik­i­ness by avoid­ing max­i­miza­tion, thereby avoid­ing steer­ing the fu­ture too hard.

  • If you have non-ob­struc­tion for lots of goals, you don’t have spik­i­ness!

What Do We Want?

Main idea: we want good things to hap­pen; there may be more ways to do this than pre­vi­ously con­sid­ered.

Im­pactAc­tu­ally makes good things hap­pen.

Cor­rigi­bil­ity is a prop­erty of poli­cies, not of states; “im­pact” is an in­com­pat­i­ble ad­jec­tive.

Ro­hin Shah sug­gests “em­piri­cal cor­rigi­bil­ity”: we ac­tu­ally end up able to cor­rect the AI.

Ac­tu­ally doesn’t de­crease AUs.
In­ten­tTries to make good things hap­pen.Tries to al­low us to cor­rect it with­out it ma­nipu­lat­ing us.Tries to not de­crease AUs.

We want agents which are max­i­mally im­pact-al­igned with as many goals as pos­si­ble, es­pe­cially those similar to our own.

  • It’s the­o­ret­i­cally pos­si­ble to achieve max­i­mal im­pact al­ign­ment with the vast ma­jor­ity of goals.

    • To achieve max­i­mum im­pact al­ign­ment with goal set :

      • Ex­pand the hu­man’s ac­tion space to . Ex­pand the state space to en­code the hu­man’s pre­vi­ous ac­tion.

      • Each turn, the hu­man com­mu­ni­cates what goal they want op­ti­mized, and takes an ac­tion of their own.

      • The AI’s policy then takes the op­ti­mal ac­tion for the com­mu­ni­cated goal , ac­count­ing for the fact that the hu­man fol­lows

    • This policy looks like an act-based agent, in that it’s ready to turn on a dime to­wards differ­ent goals.

    • In prac­tice, there’s likely a trade­off with im­pact-al­ign­ment-strength and the # of goals which the agent doesn’t ob­struct.

      • As we dive into speci­fics, the fa­mil­iar con­sid­er­a­tions re­turn: com­pet­i­tive­ness (of var­i­ous kinds), etc.

  • Hav­ing the AI not be coun­ter­fac­tu­ally al­igned with un­am­bigu­ously catas­trophic and im­moral goals (like tor­ture) would re­duce mi­suse risk.

    • I’m more wor­ried about ac­ci­dent risk right now.

    • This is prob­a­bly hard to achieve; I’m in­clined to think about this af­ter we figure out sim­pler things, like how to in­duce AI poli­cies which em­power us and grant us flex­ible con­trol/​power over the fu­ture. Even though that would fall short of max­i­mal im­pact al­ign­ment, I think that would be pretty damn good.

Ex­pand­ing the AI al­ign­ment solu­tion space

Align­ment pro­pos­als might be an­chored right now; this frame ex­pands the space of po­ten­tial solu­tions. We sim­ply need to find some way to re­li­ably in­duce em­pow­er­ing AI poli­cies which ro­bustly in­crease the hu­man AUs; As­sis­tance via Em­pow­er­ment is the only work I’m aware of which tries to do this di­rectly. It might be worth re­vis­it­ing old work with this lens in mind. Who knows what we’ve missed?

For ex­am­ple, I re­ally liked the idea of ap­proval-di­rected agents, be­cause you got the policy from ’ing an ML model’s out­put for a state—not from RL policy im­prove­ment steps. My work on in­stru­men­tal con­ver­gence in RL can be seen as try­ing to ex­plain why policy im­prove­ment tends to limit to spik­i­ness-in­duc­ing /​ catas­trophic poli­cies.

Maybe there’s a higher-level the­ory for what kinds of poli­cies in­duce spik­i­ness in our AU land­scape. By the na­ture of spik­i­ness, these must de­crease hu­man power (as I’ve for­mal­ized it). So, I’d start there by look­ing at con­cepts like en­fee­ble­ment, ma­nipu­la­tion, power-seek­ing, and re­source ac­cu­mu­la­tion.

Fu­ture Directions

  • Given an AI policy, could we prove a high prob­a­bil­ity of non-ob­struc­tion, given con­ser­va­tive as­sump­tions about how smart is? (h/​t Abram Dem­ski, Ro­hin Shah)

    • Any ir­re­versible ac­tion makes some goal un­achiev­able, but ir­re­versible ac­tions need not im­pede most mean­ingful goals:

  • Can we prove that some kind of cor­rigi­bil­ity or other nice prop­erty falls out of non-ob­struc­tion across many pos­si­ble en­vi­ron­ments? (h/​t Michael Den­nis)

  • Can we get nega­tive re­sults, like “with­out such-and-such as­sump­tion on , the en­vi­ron­ment, or , non-ob­struc­tion is im­pos­si­ble for most goals”?

    • If for­mal­ized cor­rectly, and if the as­sump­tions hold, this would place very gen­eral con­straints on solu­tions to the al­ign­ment prob­lem.

    • For ex­am­ple, should need to have mu­tual in­for­ma­tion with : the goal must change the policy for at least a few goals.

    • The AI doesn’t even have to do value in­fer­ence in or­der to be broadly im­pact-al­igned. The AI could just em­power the hu­man (even for very “dumb” func­tions) and then let the hu­man take over. Un­less the hu­man is more anti-ra­tio­nal than ra­tio­nal, this should tend to be a good thing. It would be good to ex­plore how this changes with differ­ent ways that can be ir­ra­tional.

  • The bet­ter we un­der­stand (the benefits of) cor­rigi­bil­ity now, the less that am­plified agents have to figure out dur­ing their own de­liber­a­tion.

    • In par­tic­u­lar, I think it’s very ad­van­ta­geous for the hu­man-to-be-am­plified to already deeply un­der­stand what it means to be im­pact-/​in­tent-al­igned. We re­ally don’t want that part to be up in the air when game-day fi­nally ar­rives, and I think this is a piece of that puz­zle.

    • If you’re a smart AI try­ing to be non-ob­struc­tive to many goals un­der weak in­tel­li­gence as­sump­tions, what kinds of heuris­tics might you de­velop? “No ly­ing”?

  • We cru­cially as­sumed that the hu­man goal can be rep­re­sented with a pay­off func­tion. As this as­sump­tion is re­laxed, im­pact non-ob­struc­tion may be­come in­co­her­ent, forc­ing us to rely on some kind of in­tent non-ob­struc­tion/​al­ign­ment (see Paul’s com­ments on a re­lated topic here).

  • Stu­art Arm­strong ob­served that the strongest form of ma­nipu­la­tion cor­rigi­bil­ity re­quires knowl­edge/​learn­ing of hu­man val­ues.

    • This frame ex­plains why: for non-ob­struc­tion, each AU has to get steered in a pos­i­tive di­rec­tion, which means the AI has to know which kinds of in­ter­ac­tion and per­sua­sion are good and don’t ex­ploit hu­man poli­cies with re­spect to the true hid­den .

    • Per­haps it’s still pos­si­ble to build agent de­signs which aren’t strongly in­cen­tivized to ma­nipu­late us /​ agents whose ma­nipu­la­tion has mild con­se­quences. For ex­am­ple, hu­man-em­pow­er­ing agents prob­a­bly of­ten have this prop­erty.

The at­tain­able util­ity con­cept has led to other con­cepts which I find ex­cit­ing and use­ful:

Im­pact is the area be­tween the red and green curves. When always out­puts an op­ti­mal policy, this be­comes the at­tain­able util­ity dis­tance, a dis­tance met­ric over the state space of a Markov de­ci­sion pro­cess (un­pub­lished work). Ba­si­cally, two states are more dis­tant the more they differ in what goals they let you achieve.


Cor­rigi­bil­ity is mo­ti­vated by a coun­ter­fac­tual form of weak im­pact al­ign­ment: non-ob­struc­tion. Non-ob­struc­tion and the AU land­scape let us think clearly about how an AI af­fects us and about AI al­ign­ment desider­ata.

Even if we could max­i­mally im­pact-al­ign the agent with any ob­jec­tive, we couldn’t just al­ign it our ob­jec­tive, be­cause we don’t know our ob­jec­tive. There­fore, we should build an AI al­igned with many pos­si­ble goals we could have. If the AI doesn’t em­power us, it at least shouldn’t ob­struct us. There­fore, we should build an AI which defers to us, lets us cor­rect it, and which doesn’t ma­nipu­late us.

This is the key mo­ti­va­tion for cor­rigi­bil­ity.

Cor­rigi­bil­ity is an in­stru­men­tal strat­egy for achiev­ing non-ob­struc­tion, which is it­self an in­stru­men­tal strat­egy for achiev­ing im­pact al­ign­ment for a wide range of goals, which is it­self an in­stru­men­tal strat­egy for achiev­ing im­pact al­ign­ment for our “real” goal.

There’s just some­thing about “un­wanted ma­nipu­la­tion” which feels like a wrong ques­tion to me. There’s a kind of con­cep­tual crisp­ness that it lacks.

How­ever, in the non-ob­struc­tion frame­work, un­wanted ma­nipu­la­tion is ac­counted for in­di­rectly via “did im­pact al­ign­ment de­crease for a wide range of differ­ent hu­man poli­cies ?”. I think I wouldn’t be sur­prised to find “ma­nipu­la­tion” be­ing ac­counted for in­di­rectly through nice for­mal­isms, but I’d be sur­prised if it were ac­counted for di­rectly.

Here’s an­other ex­am­ple of the dis­tinc­tion:

  • Direct: quan­tify­ing in bits “how much” a spe­cific per­son is learn­ing at a given point in time

  • Indi­rect: com­pu­ta­tional neu­ro­scien­tists up­per-bound­ing the brain’s chan­nel ca­pac­ity with the en­vi­ron­ment, limit­ing how quickly a per­son (with­out log­i­cal un­cer­tainty) can learn about their environment

You can of­ten have crisp in­sights into fuzzy con­cepts, such that your ex­pec­ta­tions are use­fully con­strained. I hope we can do some­thing similar for ma­nipu­la­tion.