Clarifying “AI Alignment”

When I say an AI A is al­igned with an op­er­a­tor H, I mean:

A is try­ing to do what H wants it to do.

The “al­ign­ment prob­lem” is the prob­lem of build­ing pow­er­ful AI sys­tems that are al­igned with their op­er­a­tors.

This is sig­nifi­cantly nar­rower than some other defi­ni­tions of the al­ign­ment prob­lem, so it seems im­por­tant to clar­ify what I mean.

In par­tic­u­lar, this is the prob­lem of get­ting your AI to try to do the right thing, not the prob­lem of figur­ing out which thing is right. An al­igned AI would try to figure out which thing is right, and like a hu­man it may or may not suc­ceed.


Con­sider a hu­man as­sis­tant who is try­ing their hard­est to do what H wants.

I’d say this as­sis­tant is al­igned with H. If we build an AI that has an analo­gous re­la­tion­ship to H, then I’d say we’ve solved the al­ign­ment prob­lem.

“Aligned” doesn’t mean “perfect:”

  • They could mi­s­un­der­stand an in­struc­tion, or be wrong about what H wants at a par­tic­u­lar mo­ment in time.

  • They may not know ev­ery­thing about the world, and so fail to rec­og­nize that an ac­tion has a par­tic­u­lar bad side effect.

  • They may not know ev­ery­thing about H’s prefer­ences, and so fail to rec­og­nize that a par­tic­u­lar side effect is bad.

  • They may build an un­al­igned AI (while at­tempt­ing to build an al­igned AI).

I use al­ign­ment as a state­ment about the mo­tives of the as­sis­tant, not about their knowl­edge or abil­ity. Im­prov­ing their knowl­edge or abil­ity will make them a bet­ter as­sis­tant — for ex­am­ple, an as­sis­tant who knows ev­ery­thing there is to know about H is less likely to be mis­taken about what H wants — but it won’t make them more al­igned.

(For very low ca­pa­bil­ities it be­comes hard to talk about al­ign­ment. For ex­am­ple, if the as­sis­tant can’t rec­og­nize or com­mu­ni­cate with H, it may not be mean­ingful to ask whether they are al­igned with H.)


  • The defi­ni­tion is in­tended de dicto rather than de re. An al­igned A is try­ing to “do what H wants it to do.” Sup­pose A thinks that H likes ap­ples, and so goes to the store to buy some ap­ples, but H re­ally prefers or­anges. I’d call this be­hav­ior al­igned be­cause A is try­ing to do what H wants, even though the thing it is try­ing to do (“buy ap­ples”) turns out not to be what H wants: the de re in­ter­pre­ta­tion is false but the de dicto in­ter­pre­ta­tion is true.

  • An al­igned AI can make er­rors, in­clud­ing moral or psy­cholog­i­cal er­rors, and fix­ing those er­rors isn’t part of my defi­ni­tion of al­ign­ment ex­cept in­so­far as it’s part of get­ting the AI to “try to do what H wants” de dicto. This is a crit­i­cal differ­ence be­tween my defi­ni­tion and some other com­mon defi­ni­tions. I think that us­ing a broader defi­ni­tion (or the de re read­ing) would also be defen­si­ble, but I like it less be­cause it in­cludes many sub­prob­lems that I think (a) are much less ur­gent, (b) are likely to in­volve to­tally differ­ent tech­niques than the ur­gent part of al­ign­ment.

  • An al­igned AI would also be try­ing to do what H wants with re­spect to clar­ify­ing H’s prefer­ences. For ex­am­ple, it should de­cide whether to ask if H prefers ap­ples or or­anges, based on its best guesses about how im­por­tant the de­ci­sion is to H, how con­fi­dent it is in its cur­rent guess, how an­noy­ing it would be to ask, etc. Of course, it may also make a mis­take at the meta level — for ex­am­ple, it may not un­der­stand when it is OK to in­ter­rupt H, and there­fore avoid ask­ing ques­tions that it would have been bet­ter to ask.

  • This defi­ni­tion of “al­ign­ment” is ex­tremely im­pre­cise. I ex­pect it to cor­re­spond to some more pre­cise con­cept that cleaves re­al­ity at the joints. But that might not be­come clear, one way or the other, un­til we’ve made sig­nifi­cant progress.

  • One rea­son the defi­ni­tion is im­pre­cise is that it’s un­clear how to ap­ply the con­cepts of “in­ten­tion,” “in­cen­tive,” or “mo­tive” to an AI sys­tem. One naive ap­proach would be to equate the in­cen­tives of an ML sys­tem with the ob­jec­tive it was op­ti­mized for, but this seems to be a mis­take. For ex­am­ple, hu­mans are op­ti­mized for re­pro­duc­tive fit­ness, but it is wrong to say that a hu­man is in­cen­tivized to max­i­mize re­pro­duc­tive fit­ness.

  • “What H wants” is even more prob­le­matic than “try­ing.” Clar­ify­ing what this ex­pres­sion means, and how to op­er­a­tional­ize it in a way that could be used to in­form an AI’s be­hav­ior, is part of the al­ign­ment prob­lem. Without ad­di­tional clar­ity on this con­cept, we will not be able to build an AI that tries to do what H wants it to do.

Postscript on ter­minolog­i­cal history

I origi­nally de­scribed this prob­lem as part of “the AI con­trol prob­lem,” fol­low­ing Nick Bostrom’s us­age in Su­per­in­tel­li­gence, and used “the al­ign­ment prob­lem” to mean “un­der­stand­ing how to build AI sys­tems that share hu­man prefer­ences/​val­ues” (which would in­clude efforts to clar­ify hu­man prefer­ences/​val­ues).

I adopted the new ter­minol­ogy af­ter some peo­ple ex­pressed con­cern with “the con­trol prob­lem.” There is also a slight differ­ence in mean­ing: the con­trol prob­lem is about cop­ing with the pos­si­bil­ity that an AI would have differ­ent prefer­ences from its op­er­a­tor. Align­ment is a par­tic­u­lar ap­proach to that prob­lem, namely avoid­ing the prefer­ence di­ver­gence al­to­gether (so ex­clud­ing tech­niques like “put the AI in a re­ally se­cure box so it can’t cause any trou­ble”). There cur­rently seems to be a ten­ta­tive con­sen­sus in fa­vor of this ap­proach to the con­trol prob­lem.

I don’t have a strong view about whether “al­ign­ment” should re­fer to this prob­lem or to some­thing differ­ent. I do think that some term needs to re­fer to this prob­lem, to sep­a­rate it from other prob­lems like “un­der­stand­ing what hu­mans want,” “solv­ing philos­o­phy,” etc.

This post was origi­nally pub­lished here on 7th April 2018.

The next post in this se­quence will post on Satur­day, and will be “An Unal­igned Bench­mark” by Paul Chris­ti­ano.

To­mor­row’s AI Align­ment Se­quences post will be the first in a short new se­quence of tech­ni­cal ex­er­cises from Scott Garrabrant.