Inner Alignment: Explain like I’m 12 Edition

(This is an un­offi­cial ex­pla­na­tion of In­ner Align­ment based on the Miri pa­per Risks from Learned Op­ti­miza­tion in Ad­vanced Ma­chine Learn­ing Sys­tems (which is al­most iden­ti­cal to the LW se­quence) and the Fu­ture of Life pod­cast with Evan Hub­inger (Miri/​LW). It’s meant for any­one who found the se­quence too long/​challeng­ing/​tech­ni­cal to read.)

Note that bold and ital­ics means “this is a new term I’m in­tro­duc­ing,” whereas un­der­line and ital­ics is used for em­pha­sis.

What is In­ner Align­ment?

Let’s start with an abridged guide to how Ma­chine Learn­ing works:

  1. Choose a problem

  2. De­cide on a space of pos­si­ble solutions

  3. Find a good solu­tion from that space

If the prob­lem is “find a tool that can look at any image and de­cide whether or not it con­tains a cat,” then each con­ceiv­able set of rules for an­swer­ing this ques­tion (for­mally, each func­tion from the set of all pix­els to the set ) defines one solu­tion. We call each such solu­tion a model. The space of pos­si­ble mod­els is de­picted be­low.

Since that’s all pos­si­ble mod­els, most of them are ut­ter non­sense.

Pick a ran­dom one, and you’re as likely to end up with a car-rec­og­nizer than a cat-rec­og­nizer – but far more likely with an al­gorithm that does noth­ing we can in­ter­pret. Note that even the ex­am­ples I an­no­tated aren’t typ­i­cal – most mod­els would be more com­plex while still do­ing noth­ing re­lated to cats. Nonethe­less, some­where in there is a model that would do a de­cent job on our prob­lem. In the above, that’s the one that says, “I look for cats.”

How does ML find such a model? One way that does not work is try­ing out all of them. That’s be­cause the space is too large: it might con­tain over can­di­dates. In­stead, there’s this thing called Stochas­tic Gra­di­ent Des­cent (SGD). Here’s how it works:

SGD be­gins with some (prob­a­bly ter­rible) model and then pro­ceeds in steps. In each step, it switches to an­other model that is “close” and hope­fully a lit­tle bet­ter. Even­tu­ally, it stops and out­puts the most re­cent model. Note that, in the ex­am­ple above, we don’t end up with the perfect cat-rec­og­nizer (the red box) but with some­thing close to it – per­haps a model that looks for cats but has some un­in­tended quirks. SGD does gen­er­ally not guaran­tee op­ti­mal­ity.

The speech bub­bles where the mod­els ex­plain what they’re do­ing are an­no­ta­tions for the reader. From the per­spec­tive of the pro­gram­mer, it looks like this:

The pro­gram­mer has no idea what the mod­els are do­ing. Each model is just a black box.

A nec­es­sary com­po­nent for SGD is the abil­ity to mea­sure a model’s perfor­mance, but this hap­pens while treat­ing them as black boxes. In the cat ex­am­ple, as­sume the pro­gram­mer has a bunch of images that are ac­cu­rately la­beled as “con­tains cat” and “doesn’t con­tain cat.” (Th­ese images are called the train­ing data and the set­ting is called su­per­vised learn­ing.) SGD tests how well each model does on these images and, in each step, chooses one that does bet­ter. In other set­tings, perfor­mance might be mea­sured in differ­ent ways, but the prin­ci­ple re­mains the same.

Now, sup­pose that the images we have hap­pen to in­clude only white cats. In this case, SGD might choose a model im­ple­ment­ing the rule “out­put yes if there is some­thing white and with four legs.” The pro­gram­mer would not no­tice any­thing strange – all she sees is that the model out­put by SGD does well on the train­ing data.

In this set­ting, there is thus only a prob­lem if our way of ob­tain­ing feed­back is flawed. If it is perfect – if the pic­tures with cats are perfectly rep­re­sen­ta­tive of what images-with-cats are like, and the pic­tures with­out cats are perfectly rep­re­sen­ta­tive of what images-with­out-cats are like, then there isn’t an is­sue. Con­versely, if our images-with-cats are non-rep­re­sen­ta­tive be­cause all cats are white, the model SGD out­puts might not be do­ing pre­cisely what the pro­gram­mer wanted. In Ma­chine Learn­ing slang, we would say that the train­ing dis­tri­bu­tion is differ­ent from the dis­tri­bu­tion in de­ploy­ment.

Is this In­ner Align­ment? Not quite. This is about a prop­erty called dis­tri­bu­tional ro­bust­ness, and it’s a well-known prob­lem in Ma­chine Learn­ing. But it’s close.

To ex­plain In­ner Align­ment it­self, we have to switch to a differ­ent set­ting. Sup­pose that, in­stead of try­ing to clas­sify whether images con­tain cats, we are try­ing to train a model that solves mazes. That is, we want an al­gorithm that, given an ar­bi­trary solv­able maze, out­puts a route from the Maze En­try to the Maze Exit.

As of be­fore, our space of all pos­si­ble mod­els will con­sist pri­mar­ily of non­sense solu­tions:

(If you don’t know what depth-first search means: as far as mazes are con­cerned, it’s sim­ply the “always go left” rule.)

Note that the an­no­ta­tion “I perform depth-first search” is meant to sug­gest that the model con­tains a for­mal al­gorithm that im­ple­ments depth-first search, and analo­gously with the other an­no­ta­tions.

As with the pre­vi­ous ex­am­ple, we might ap­ply SGD to this prob­lem. In this case, the feed­back mechanism would come from eval­u­at­ing the model on test mazes. Now, sup­pose that all of the test mazes have this form,

where the red ar­eas rep­re­sent doors. That is, all mazes are such that the short­est path leads through all of the red doors, and the exit is it­self a red door.

Look­ing at this, you might hope that SGD finds the “depth-first” model. How­ever, while that model would find the short­est path, it is not the best model. (Note that it first performs depth-first search and then, once it has found the right path, dis­cards dead ends and out­puts the short­est path only). The al­ter­na­tive model with an­no­ta­tion “perform breadth-first search to find the next red door, re­peat for­ever” would perform bet­ter. (Breadth-first means ex­plor­ing all pos­si­ble paths in par­allel.) Both mod­els always find the short­est path, but the red-door model would find it more quickly. In the maze above, it would save time by find­ing the path from the first to the sec­ond door with­out wast­ing time ex­plor­ing the lower-left part of the maze.

Note that breadth-first search only out­performs depth-first search be­cause it can trun­cate the fruitless paths af­ter hav­ing reached the red door. Other­wise, it wouldn’t know that the bot­tom-left part is fruitless un­til much later in the search.

As of be­fore, all the pro­gram­mer will see is that the left model performs bet­ter on the train­ing data (the test mazes).

The qual­i­ta­tive differ­ence to the cat pic­ture ex­am­ple is that, in this case, we can talk about the model as run­ning an op­ti­miza­tion pro­cess. That is, the breadth-first search model does it­self have an ob­jec­tive (go through red doors), and it tries to op­ti­mize for that in the sense that it searches for the short­est path that leads there. Similarly, the depth-first model is an op­ti­miza­tion pro­cess with the ob­jec­tive “find exit of maze.”

This is enough to define In­ner Align­ment, but to make sure the defi­ni­tion is the same that one reads el­se­where, let’s first define two new terms.

  • The Base Ob­jec­tive is the ob­jec­tive we use to eval­u­ate mod­els found by SGD. In the first ex­am­ple, it was “clas­sify pic­tures cor­rectly (i.e., say “con­tains cat” if it con­tains a cat and “doesn’t con­tain cat” oth­er­wise). In the sec­ond ex­am­ple, it was “find [a short­est path that solves mazes] as quickly as pos­si­ble.”

  • In the cases where the model is run­ning an op­ti­miza­tion pro­cess, we call the model a Mesa Op­ti­mizer, and we call its ob­jec­tive the Mesa Ob­jec­tive (in the maze ex­am­ple, the mesa ob­jec­tive is “find short­est path through maze” for the depth-first model, and “re­peat­edly find short­est path to the next red door” for the breadth-first model).

With that said,

In­ner Align­ment is the prob­lem of al­ign­ing the Base Ob­jec­tive with the Mesa Ob­jec­tive.

Some clar­ify­ing points:

  • The red-door ex­am­ple is thor­oughly con­trived and would not hap­pen in prac­tice. It only aims to ex­plain what In­ner Align­ment is, not why mis­al­ign­ment might be prob­a­ble.

  • You might won­der what the space of all mod­els looks like. The typ­i­cal an­swer is that the pos­si­ble mod­els are sets of weights for a neu­ral net­work. The prob­lem ex­ists in­so­far as some sets of weights im­ple­ment spe­cific search al­gorithms.

  • As of be­fore, the rea­son for the in­ner al­ign­ment failure was that our way of ob­tain­ing feed­back was flawed (in ML lan­guage: be­cause there was dis­tri­bu­tional shift). It is con­ceiv­able that mis­al­ign­ment can also arise for other rea­sons, but those are out­side the scope of this post.

  • If the Base Ob­jec­tive and Mesa Ob­jec­tive are mis­al­igned, this causes prob­lems as soon as the model is de­ployed. In the sec­ond ex­am­ple, as soon as we take the model out­put by SGD and ap­ply it to real mazes, it would still search for red doors. If those mazes don’t con­tain red doors, or the red doors aren’t always on paths to the exit, the model would perform poorly.

Here is the rele­vant Venn-Di­a­gram. (Rel­a­tive sizes don’t mean any­thing.)

Note that {What AI tries to do} = {Mesa Ob­jec­tive} by defi­ni­tion.

Most clas­si­cal dis­cus­sion of AI al­ign­ment, in­clud­ing most of the book Su­per­in­tel­li­gence, is about Outer Align­ment. The clas­si­cal ex­am­ples where we as­sume the AI is op­ti­mized to cure can­cer and then kills hu­mans so that no-one can have can­cer any­more is about a mis­al­ign­ment of {What Pro­gram­mers want} and the {Base Ob­jec­tive}. (The Base Ob­jec­tive is {min­i­mize the num­ber of peo­ple who have can­cer}, and while it’s not clear what the pro­gram­mers want, it’s cer­tainly not that.)

The Anal­ogy to Evolution

Ar­gu­ments about In­ner Align­ment of­ten make refer­ence to evolu­tion. The rea­son is that evolu­tion is an op­ti­miza­tion pro­cess – it op­ti­mizes for in­clu­sive ge­netic fit­ness. The space of all mod­els is the space of all pos­si­ble or­ganisms.

Hu­mans are cer­tainly not the best model in this space – I’ve added the de­scrip­tion on the bot­tom right to in­di­cate that there are bet­ter mod­els that haven’t been found yet. How­ever, hu­mans are, un­doubt­edly, the best model that evolu­tion has found so far.

As with the maze ex­am­ple, hu­mans do them­selves run op­ti­miza­tion pro­cesses. Thus, we can call them/​us Mesa Op­ti­mizes, and we can com­pare the Base Ob­jec­tive (the one evolu­tion max­i­mizes for) with the Mesa Ob­jec­tive (the one hu­mans op­ti­mize for).

  • Base Ob­jec­tive: max­i­mize in­clu­sive ge­netic fitness

  • Mesa Ob­jec­tive: avoid pain, seek pleasure

(This is sim­plified – some hu­mans op­ti­mize for other things, such as the well-be­ing of all pos­si­ble minds in the uni­verse – but those are no closer to the Base Ob­jec­tive.)

We can see that hu­mans are not al­igned with the base ob­jec­tive of evolu­tion. And it is easy to see why – the way Evan Hub­inger put it is to imag­ine the coun­ter­fac­tual world where evolu­tion did se­lect in­ner-al­igned mod­els. In this world, a baby who stabs its toe has to com­pute how stab­bing its toe af­fects its in­clu­sive ge­netic fit­ness be­fore know­ing whether or not to re­peat this be­hav­ior in the fu­ture. This would be com­pu­ta­tion­ally ex­pen­sive, whereas the “avoid pain” ob­jec­tive im­me­di­ately tells the baby that , which is much cheaper and al­most always the cor­rect an­swer. Thus, an un­al­igned model out­performs the hy­po­thet­i­cal al­igned model. Another in­ter­est­ing as­pect is that the size of the mis­al­ign­ment (the differ­ence be­tween the Base Ob­jec­tive and the Mesa Ob­jec­tive) has widened over the last few mil­len­nia. In the an­ces­tral en­vi­ron­ment, they were pretty close, but now, they are so far apart that we need to pay peo­ple to donate their sperm, which, ac­cord­ing to the Base Ob­jec­tive, ought to be the most de­sir­able ac­tion imag­in­able.

Con­se­quently, the anal­ogy might be an ar­gu­ment for why In­ner Misal­ign­ment is prob­a­ble since it has oc­curred “nat­u­rally” in the biggest non-hu­man-caused op­ti­miza­tion pro­cess we know. How­ever, the big caveat here is that evolu­tion does not im­ple­ment Stochas­tic Gra­di­ent Des­cent. Evolu­tion nav­i­gates the model space by perform­ing ran­dom mu­ta­tions and then eval­u­at­ing perfor­mance, which is fun­da­men­tally differ­ent (and a billion times less effi­cient) from mod­ify­ing the model ac­cord­ing to the ex­pected deriva­tive of the loss func­tion, which is what SGD does. Thus, while the anal­ogy works in most ways, it stops work­ing as soon as one makes ar­gu­ments that rely on prop­er­ties of SGD other than that it op­ti­mizes the Base Ob­jec­tive.

De­cep­tive Alignment

This is the abridged ver­sion of the fourth part of the se­quence. I’m link­ing to it be­cause this is prob­a­bly the one where leav­ing out the tech­ni­cal de­tails is the most prob­le­matic.

The concept

In this sec­tion, we make the fol­low­ing as­sump­tions:

  • The learn­ing task is hard, and there­fore, mod­els are very com­plex. Think of a ques­tion-an­swer­ing sys­tem, rather than an image clas­sifier.

  • In­stead of hav­ing a sin­gle learn­ing pro­cess, we up­date a model over time.

  • The learn­ing pro­cess will se­lect a Mesa Op­ti­mizer.

  • The Base Ob­jec­tive is com­pli­cated, and the model won’t get it right im­me­di­ately (i.e., the model starts out not be­ing in­ner-al­igned).

Since the model is so­phis­ti­cated, we as­sume that it builds up a model of the world some­how. Think of GPT-3 (the lan­guage model that can write text): it clearly rec­og­nizes whether you’re prompt­ing it about Lord of the Rings or about poli­tics. This shows that it has an in­ter­nal model of these things, how­ever flawed or in­com­plete.

Thus, if we look in­side the model (which, again, the pro­gram­mers can­not do), we have the fol­low­ing two com­po­nents:

Re­call that the model is a Mesa Op­ti­mizer by as­sump­tion, hence we know it has an ob­jec­tive. This (red blob) is the Mesa Ob­jec­tive.

As its model of the world im­proves, it might even­tu­ally in­clude a model of the Base Ob­jec­tive. Re­call that the Base Ob­jec­tive is what SGD op­ti­mizes for.

I’ve drawn the model of the base ob­jec­tive (pur­ple blob) larger than the Mesa Ob­jec­tive since we as­sume the Base Ob­jec­tive is fairly com­plex.

SGD tries to make the model bet­ter, and if [the thing that the model op­ti­mizes for] be­comes more similar to the Base Ob­jec­tive, the model does be­come bet­ter. There­fore, we spec­u­late that the model will change such that this hap­pens. We fur­ther spec­u­late that there are three differ­ent ways this could hap­pen, which I’ll illus­trate be­low.

The first is called In­ter­nal­iza­tion. The model adopts the Base Ob­jec­tive out­right.

The sec­ond is called Cor­rigi­bil­ity. The Mesa Ob­jec­tive is no longer rep­re­sented ex­plic­itly – in­stead, it is a poin­ter to the part of its world model that in­cludes the Base Ob­jec­tive.

The third is called De­cep­tive Align­ment. Here, the Mesa Ob­jec­tive re­mains differ­ent from the Base Ob­jec­tive, but the model chooses to op­ti­mize the Base Ob­jec­tive any­way be­cause do­ing so makes the learn­ing pro­ce­dure think it is al­igned. Thus, the dot­ted poin­ter does not in­di­cate that the Mesa Ob­jec­tive is the Base Ob­jec­tive, nor that it points to the base ob­jec­tive, but that the model will op­ti­mize the Base Ob­jec­tive for in­stru­men­tal rea­sons, i.e., as part of its strat­egy for op­ti­miz­ing the Mesa Ob­jec­tive.

For de­cep­tive al­ign­ment to be pos­si­ble, we re­quire that (a) the model knows it is in a train­ing pro­cess, and (b) it cares about its ob­jec­tive long-term. For de­cep­tive al­ign­ment to be­come a prob­lem, we fur­ther re­quire that (c) the model ex­pects that it even­tu­ally leaves the train­ing pro­cess per­ma­nently. If all of those are given, the pa­per uses the fol­low­ing ex­am­ple to illus­trate why a model might be de­cep­tively al­igned (i.e., might op­ti­mize the Base Ob­jec­tive even though it’s not what it ter­mi­nally wants).

In this case, the Base Ob­jec­tive is “go to A” and the Mesa Ob­jec­tive “go to B.” If the model that wants to go to B is smart, it will go to A in the train­ing pro­cess since that al­lows it to go to B dur­ing de­ploy­ment, and there are three de­ploy­ment epi­sodes vs. only two train­ing epi­sodes. Thus, op­ti­miz­ing the Base Ob­jec­tive dur­ing train­ing yields bet­ter re­sults for the Mesa Ob­jec­tive (in the long-term) than op­ti­miz­ing for the Mesa Ob­jec­tive out­right. This is de­cep­tive al­ign­ment.

Buck Sh­legeris (Miri/​LW) has the fol­low­ing anal­ogy illus­trat­ing the three cases. Sup­pose the Chris­tian God wants to cre­ate hu­mans that do what they want (god’s util­ity func­tion is the base ob­jec­tive). The three kinds of al­ign­ment have analogs in what kinds of hu­mans do what god wants.

  • For in­ter­nal­iza­tion, think of Je­sus. He does what god wants be­cause he has the same util­ity func­tion as god.

  • For proxy al­ign­ment, think of Martin Luther. He does what God wants be­cause his util­ity func­tion is “figure out what the bible wants and do that.”

  • For de­cep­tive al­ign­ment, his ex­am­ple was Blaise Pas­cal; how­ever, I don’t need that ex­am­ple be­cause I used to have a friend in mid­dle and high school who ex­plic­itly wanted to do what the bible said as he wanted to go to heaven. He openly ad­mit­ted (to me, any­way) to have ter­mi­nal goals that con­tra­dict what the bible says, but made gen­uine efforts to ad­here to the bible any­way. I’m very con­fi­dent that this type of per­son is more com­mon than Martin Luthers or Je­suses.

One pur­pose of this anal­ogy is to illus­trate why de­cep­tive mod­els might be more prob­a­ble than what in­tu­ition might claim. De­cep­tive al­ign­ment sounds like a weird, spe­cific kind of thing, but the anal­ogy sug­gests that there are, in fact, many more de­cep­tively al­igned mod­els out there than proxy al­igned mod­els or mod­els al­igned via in­ter­nal­iza­tion. Only one util­ity func­tion is the same as that of the Chris­tian god, but many util­ity func­tions place very high util­ity on “don’t go to hell.” Similarly, there is a range of pos­si­ble Mesa Ob­jec­tives that al­low for de­cep­tive al­ign­ment, and all of those will op­ti­mize the Base Ob­jec­tive dur­ing train­ing.

The above is just an anal­ogy, but there are other rea­sons to ex­pect that de­cep­tion is likely. Speci­fi­cally, I’ll cover two fur­ther ar­gu­ments why the al­ter­na­tives are “difficult” to achieve and hence might be less likely.

Why in­ter­nal­iza­tion might be difficult

If a large neu­ral net­work pro­cesses in­put data, this leads it to mod­ify its in­ter­nal state. If we re­fer to any mod­ifi­ca­tion of that state as “learn­ing,” then we can thus differ­en­ti­ate two differ­ent ways learn­ing takes place, or two “in­put chan­nels.” One, through SGD, and two, through pro­cess­ing in­put data.

Given how much in­put data is used in mod­ern ma­chine learn­ing, it is a safe bet that far more bits of in­for­ma­tion are re­layed via the sec­ond chan­nel (pro­cess­ing in­put data). Thus, if GPT-3 cre­ates an in­ter­nal model of “Lord of the Rings,” this most likely oc­curs through in­put data rather than ex­plicit mod­ifi­ca­tions of SGD.

If we fur­ther as­sume that pro­cess­ing in­put data doesn’t di­rectly mod­ify the model’s ob­jec­tive (the Mesa Ob­jec­tive), or that its model of the Base Ob­jec­tive is cre­ated first, it fol­lows that rep­re­sent­ing a com­plex ob­jec­tive via in­ter­nal­iza­tion is harder than via “mod­el­ling” (i.e., cor­rigi­bil­ity or de­cep­tion).

In the anal­ogy to evolu­tion (which is still just an anal­ogy, and might not work in this case), it seems in­tu­itive that evolu­tion pro­gram­ming a de­tailed model of “par­ents” into our DNA is harder than it spec­i­fy­ing at­tach­ment to par­ents via refer­ence to our world model.

Why cor­rigi­bil­ity might be difficult

The ar­gu­ment in the pre­vi­ous sub­sec­tion does not ad­dress the differ­ence be­tween de­cep­tive al­ign­ment and cor­rigi­bil­ity as nei­ther of them needs to rep­re­sent a com­plex ob­jec­tive ex­plic­itly.

In­stead, the ar­gu­ment made by the pa­per is one of ro­bust­ness. It goes like so:

  • If a model is proxy-al­igned, but the poin­ter is wrong, this can cause the model’s perfor­mance to de­grade.

  • Even if the model’s world model (gray blob) im­proves to the point where it rec­og­nizes this failure, it will not cor­rect the poin­ter be­cause the poin­ter’s cur­rent tar­get is the Mesa Ob­jec­tive.

    • SGD can up­date the poin­ter, but this isn’t easy to get right.

  • Con­versely, a de­cep­tive model will im­me­di­ately ad­just its be­hav­ior if it learns that the poin­ter was flawed as do­ing so is helpful for its ob­jec­tive. (This is so be­cause its strat­egy is “hide my true ob­jec­tive for now, such that it won’t get mod­ified and I can pur­sue it in de­ploy­ment.”)

Note that this does not ex­haust all ar­gu­ments that have been made in the pa­per.


It seems to me that one un­der-ap­pre­ci­ated as­pect of In­ner Align­ment is that, even if one had the one-true-util­ity-func­tion-that-is-all-you-need-to-pro­gram-into-AI, this would not, in fact, solve the al­ign­ment prob­lem, nor even the in­tent-al­ign­ment part. It would merely solve outer al­ign­ment (pro­vided the util­ity func­tion can be for­mal­ized).

Another in­ter­est­ing point is that the plau­si­bil­ity of in­ter­nal­iza­tion (i.e., of a model rep­re­sent­ing the Base Ob­jec­tive ex­plic­itly) does not solely de­pend on the com­plex­ity of the ob­jec­tive. For ex­am­ple, evolu­tion’s ob­jec­tive of “max­i­mize in­clu­sive ge­netic fit­ness” is quite sim­ple, but it is still not rep­re­sented ex­plic­itly be­cause figur­ing out how ac­tions af­fect the ob­jec­tive is com­pu­ta­tion­ally hard. Thus, {prob­a­bil­ity of Mesa Op­ti­mizer adopt­ing an ob­jec­tive} is at least de­pen­dent on {com­plex­ity of ob­jec­tive} as well as {difficulty of as­sess­ing how ac­tions im­pact ob­jec­tive}.

[1] In prac­tice, one of­ten runs SGD mul­ti­ple times with differ­ent ini­tial­iza­tions and uses the best re­sult. Also, the out­put of SGD may be a lin­ear com­bi­na­tion of all mod­els on the way rather than just the fi­nal model.

[2] How­ever, there are efforts to cre­ate trans­parency tools to look into mod­els. Such tools might be helpful if they be­come re­ally good. Some of the pro­pos­als for build­ing safe ad­vanced AI ex­plic­itly in­clude trans­parency tools

[3] Tech­ni­cally, I be­lieve the space con­sists of DNA se­quences, and hu­man minds are de­ter­mined by DNA + ran­dom­ness. I’m not a biol­o­gist.

[4] I don’t know enough to dis­cuss this as­sump­tion.