Following human norms

So far we have been talk­ing about how to learn “val­ues” or “in­stru­men­tal goals”. This would be nec­es­sary if we want to figure out how to build an AI sys­tem that does ex­actly what we want it to do. How­ever, we’re prob­a­bly fine if we can keep learn­ing and build­ing bet­ter AI sys­tems. This sug­gests that it’s suffi­cient to build AI sys­tems that don’t screw up so badly that it ends this pro­cess. If we ac­com­plish that, then steady progress in AI will even­tu­ally get us to AI sys­tems that do what we want.

So, it might be helpful to break down the prob­lem of learn­ing val­ues into the sub­prob­lems of learn­ing what to do, and learn­ing what not to do. Stan­dard AI re­search will con­tinue to make progress on learn­ing what to do; catas­tro­phe hap­pens when our AI sys­tem doesn’t know what not to do. This is the part that we need to make progress on.

This is a prob­lem that hu­mans have to solve as well. Chil­dren learn ba­sic norms such as not to lit­ter, not to take other peo­ple’s things, what not to say in pub­lic, etc. As ar­gued in In­com­plete Con­tract­ing and AI al­ign­ment, any con­tract be­tween hu­mans is never ex­plic­itly spel­led out, but in­stead re­lies on an ex­ter­nal un­writ­ten nor­ma­tive struc­ture un­der which a con­tract is in­ter­preted. (Even if we don’t ex­plic­itly ask our cleaner not to break any vases, we still ex­pect them not to in­ten­tion­ally do so.) We might hope to build AI sys­tems that in­fer and fol­low these norms, and thereby avoid catas­tro­phe.

It’s worth not­ing that this will prob­a­bly not be an in­stance of nar­row value learn­ing, since there are sev­eral differ­ences:

  • Nar­row value learn­ing re­quires that you learn what to do, un­like norm in­fer­ence.

  • Norm fol­low­ing re­quires learn­ing from a com­plex do­main (hu­man so­ciety), whereas nar­row value learn­ing can be ap­plied in sim­pler do­mains as well.

  • Norms are a prop­erty of groups of agents, whereas nar­row value learn­ing can be ap­plied in set­tings with a sin­gle agent.

De­spite this, I have in­cluded it in this se­quence be­cause it is plau­si­ble to me that value learn­ing tech­niques will be rele­vant to norm in­fer­ence.

Par­adise prospects

With a norm-fol­low­ing AI sys­tem, the suc­cess story is pri­mar­ily around ac­cel­er­at­ing our rate of progress. Hu­mans re­main in charge of the over­all tra­jec­tory of the fu­ture, and we use AI sys­tems as tools that en­able us to make bet­ter de­ci­sions and cre­ate bet­ter tech­nolo­gies, which looks like “su­per­hu­man in­tel­li­gence” from our van­tage point to­day.

If we still want an AI sys­tem that colonizes space and op­ti­mizes it ac­cord­ing to our val­ues with­out our su­per­vi­sion, we can figure out what our val­ues are over a pe­riod of re­flec­tion, solve the al­ign­ment prob­lem for goal-di­rected AI sys­tems, and then cre­ate such an AI sys­tem.

This is quite similar to the suc­cess story in a world with Com­pre­hen­sive AI Ser­vices.

Plau­si­ble proposals

As far as I can tell, there has not been very much work on learn­ing what not to do. Ex­ist­ing ap­proaches like im­pact mea­sures and mild op­ti­miza­tion are aiming to define what not to do rather than learn it.

One ap­proach is to scale up tech­niques for nar­row value learn­ing. It seems plau­si­ble that in suffi­ciently com­plex en­vi­ron­ments, these tech­niques will learn what not to do, even though they are pri­mar­ily fo­cused on what to do in cur­rent bench­marks. For ex­am­ple, if I see that you have a clean car­pet, I can in­fer that it is a norm not to walk over the car­pet with muddy shoes. If you have an un­bro­ken vase, I can in­fer that it is a norm to avoid knock­ing it over. This pa­per of mine shows how this you can reach these sorts of con­clu­sions with nar­row value learn­ing (speci­fi­cally a var­i­ant of IRL).

Another ap­proach would be to scale up work on ad hoc team­work. In ad hoc team­work, an AI agent must learn to work in a team with a bunch of other agents, with­out any prior co­or­di­na­tion. While cur­rent ap­pli­ca­tions are very task-based (eg. play­ing soc­cer as a team), it seems pos­si­ble that as this is ap­plied to more re­al­is­tic en­vi­ron­ments, the re­sult­ing agents will need to in­fer norms of the group that they are in­tro­duced into. It’s par­tic­u­larly nice be­cause it ex­plic­itly mod­els the mul­ti­a­gent set­ting, which seems cru­cial for in­fer­ring norms. It can also be thought of as an al­ter­na­tive state­ment of the prob­lem of AI safety: how do you “drop in” an AI agent into a “team” of hu­mans, and have the AI agent co­or­di­nate well with the “team”?

Po­ten­tial pros

Value learn­ing is hard, not least be­cause it’s hard to define what val­ues are, and we don’t know our own val­ues to the ex­tent that they ex­ist at all. How­ever, we do seem to do a pretty good job of learn­ing so­ciety’s norms. So per­haps this prob­lem is sig­nifi­cantly eas­ier to solve. Note that this is an ar­gu­ment that norm-fol­low­ing is eas­ier than am­bi­tious value learn­ing, not that it is eas­ier than other ap­proaches such as cor­rigi­bil­ity.

It is also feels eas­ier to work on in­fer­ring norms right now. We have many ex­am­ples of norms that we fol­low; so we can more eas­ily eval­u­ate whether cur­rent sys­tems are good at fol­low­ing norms. In ad­di­tion, ad hoc team­work seems like a good start at for­mal­iz­ing the prob­lem, which we still don’t re­ally have for “val­ues”.

This also more closely mir­rors our tried-and-true tech­niques for solv­ing the prin­ci­pal-agent prob­lem for hu­mans: there is a shared, ex­ter­nal sys­tem of norms, that ev­ery­one is ex­pected to fol­low, and sys­tems of law and pun­ish­ment are in­ter­preted with re­spect to these norms. For a much more thor­ough dis­cus­sion, see In­com­plete Con­tract­ing and AI al­ign­ment, par­tic­u­larly Sec­tion 5, which also ar­gues that norm fol­low­ing will be nec­es­sary for value al­ign­ment (whereas I’m ar­gu­ing that it is plau­si­bly suffi­cient to avoid catas­tro­phe).

One po­ten­tial con­fu­sion: the pa­per says “We do not mean by this em­bed­ding into the AI the par­tic­u­lar norms and val­ues of a hu­man com­mu­nity. We think this is as im­pos­si­ble a task as writ­ing a com­plete con­tract.” I be­lieve that the mean­ing here is that we should not try to define the par­tic­u­lar norms and val­ues, not that we shouldn’t try to learn them. (In fact, later they say “Align­ing AI with hu­man val­ues, then, will re­quire figur­ing out how to build the tech­ni­cal tools that will al­low a robot to repli­cate the hu­man agent’s abil­ity to read and pre­dict the re­sponses of hu­man nor­ma­tive struc­ture, what­ever its con­tent.”)

Per­ilous pitfalls

What ad­di­tional things could go wrong with pow­er­ful norm-fol­low­ing AI sys­tems? That is, what are some prob­lems that might arise, that wouldn’t arise with a suc­cess­ful ap­proach to am­bi­tious value learn­ing?

  • Pow­er­ful AI likely leads to rapidly evolv­ing tech­nolo­gies, which might re­quire rapidly chang­ing norms. Norm-fol­low­ing AI sys­tems might not be able to help us de­velop good norms, or might not be able to adapt quickly enough to new norms. (One class of prob­lems in this cat­e­gory: we would not be ad­dress­ing hu­man safety prob­lems.)

  • Norm-fol­low­ing AI sys­tems may be un­com­pet­i­tive be­cause the norms might overly re­strict the pos­si­ble ac­tions available to the AI sys­tem, re­duc­ing nov­elty rel­a­tive to more tra­di­tional goal-di­rected AI sys­tems. (Move 37 would likely not have hap­pened if AlphaGo were trained to “fol­low hu­man norms” for Go.)

  • Norms are more like soft con­straints on be­hav­ior, as op­posed to goals that can be op­ti­mized. Cur­rent ML fo­cuses a lot more on op­ti­miza­tion than on con­straints, and so it’s not clear if we could build a com­pet­i­tive norm-fol­low­ing AI sys­tem (though see eg. Con­strained Policy Op­ti­miza­tion).

  • Re­lat­edly, learn­ing what not to do im­poses a limi­ta­tion on be­hav­ior. If an AI sys­tem is goal-di­rected, then given suffi­cient in­tel­li­gence it will likely find a near­est un­blocked strat­egy.


One promis­ing ap­proach to AI al­ign­ment is to teach AI sys­tems to in­fer and fol­low hu­man norms. While this by it­self will not pro­duce an AI sys­tem al­igned with hu­man val­ues, it may be suffi­cient to avoid catas­tro­phe. It seems more tractable than ap­proaches that re­quire us to in­fer val­ues to a de­gree suffi­cient to avoid catas­tro­phe, par­tic­u­larly be­cause hu­mans are proof that the prob­lem is sol­u­ble.

How­ever, there are still many con­cep­tual prob­lems. Most no­tably, norm fol­low­ing is not ob­vi­ously ex­press­ible as an op­ti­miza­tion prob­lem, and so may be hard to in­te­grate into cur­rent AI ap­proaches.