[AN #67]: Creating environments in which to study inner alignment failures

Link post

Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can sign up, or look through this spread­sheet of all sum­maries that have ever been in the newslet­ter. I’m always happy to hear feed­back; you can send it to me by re­ply­ing to this email.

Au­dio ver­sion here (may not be up yet).


Towards an em­piri­cal in­ves­ti­ga­tion of in­ner al­ign­ment (Evan Hub­inger) (sum­ma­rized by Ro­hin): Last week, we saw that the wor­ry­ing thing about mesa op­ti­miz­ers (AN #58) was that they could have ro­bust ca­pa­bil­ities, but not ro­bust al­ign­ment (AN#66). This leads to an in­ner al­ign­ment failure: the agent will take com­pe­tent, highly-op­ti­mized ac­tions in pur­suit of a goal that you didn’t want.

This post pro­poses that we em­piri­cally in­ves­ti­gate what kinds of mesa ob­jec­tive func­tions are likely to be learned, by try­ing to con­struct mesa op­ti­miz­ers. To do this, we need two in­gre­di­ents: first, an en­vi­ron­ment in which there are many dis­tinct prox­ies that lead to good be­hav­ior on the train­ing en­vi­ron­ment, and sec­ond, an ar­chi­tec­ture that will ac­tu­ally learn a model that is it­self perform­ing search, so that it has ro­bust ca­pa­bil­ities. Then, the ex­per­i­ment is sim­ple: train the model us­ing deep RL, and in­ves­ti­gate its be­hav­ior off dis­tri­bu­tion to dis­t­in­guish be­tween the var­i­ous pos­si­ble proxy re­ward func­tions it could have learned. (The next sum­mary has an ex­am­ple.)

Some de­sir­able prop­er­ties:

- The prox­ies should not be iden­ti­cal on the train­ing dis­tri­bu­tion.

- There shouldn’t be too many rea­son­able prox­ies, since then it would be hard to iden­tify which proxy was learned by the neu­ral net.

- Prox­ies should differ on “in­ter­est­ing” prop­er­ties, such as how hard the proxy is to com­pute from the model’s ob­ser­va­tions, so that we can figure out how a par­tic­u­lar prop­erty in­fluences whether the proxy will be learned by the model.

Ro­hin’s opinion: I’m very ex­cited by this gen­eral line of re­search: in fact, I de­vel­oped my own pro­posal along the same lines. As a re­sult, I have a lot of opinions, many of which I wrote up in this com­ment, but I’ll give a sum­mary here.

I agree pretty strongly with the high level de­tails (fo­cus­ing on ro­bust ca­pa­bil­ities with­out ro­bust al­ign­ment, iden­ti­fy­ing mul­ti­ple prox­ies as the key is­sue, and fo­cus­ing on en­vi­ron­ment de­sign and ar­chi­tec­ture choice as the hard prob­lems). I do differ in the de­tails though. I’m more in­ter­ested in pro­duc­ing a com­pel­ling ex­am­ple of mesa op­ti­miza­tion, and so I care about hav­ing a suffi­ciently com­plex en­vi­ron­ment, like Minecraft. I also don’t ex­pect there to be a “part” of the neu­ral net that is ac­tu­ally com­put­ing the mesa ob­jec­tive; I sim­ply ex­pect that the heuris­tics learned by the neu­ral net will be con­sis­tent with op­ti­miza­tion of some proxy re­ward func­tion. As a re­sult, I’m less ex­cited about study­ing prop­er­ties like “how hard is the mesa ob­jec­tive to com­pute”.

A sim­ple en­vi­ron­ment for show­ing mesa mis­al­ign­ment (Matthew Bar­nett) (sum­ma­rized by Ro­hin): This post pro­poses a con­crete en­vi­ron­ment in which we can run the ex­per­i­ments sug­gested in the pre­vi­ous post. The en­vi­ron­ment is a maze which con­tains keys and chests. The true ob­jec­tive is to open chests, but open­ing a chest re­quires you to already have a key (and uses up the key). Dur­ing train­ing, there will be far fewer keys than chests, and so we would ex­pect the learned model to de­velop an “urge” to pick up keys. If we then test it in mazes with lots of keys, it would go around com­pe­tently pick­ing up keys while po­ten­tially ig­nor­ing chests, which would count as a failure of in­ner al­ign­ment. This pre­dicted be­hav­ior is similar to how hu­mans de­vel­oped an “urge” for food be­cause food was scarce in the an­ces­tral en­vi­ron­ment, even though now food is abun­dant.

Ro­hin’s opinion: While I would pre­fer a more com­plex en­vi­ron­ment to make a more com­pel­ling case that this will be a prob­lem in re­al­is­tic en­vi­ron­ments, I do think that this would be a great en­vi­ron­ment to start test­ing in. In gen­eral, I like the pat­tern of “the true ob­jec­tive is Y, but dur­ing train­ing you need to do X to get Y”: it seems par­tic­u­larly likely that even cur­rent sys­tems would learn to com­pe­tently pur­sue X in such a situ­a­tion.

Tech­ni­cal AI alignment

Iter­ated amplification

Ma­chine Learn­ing Pro­jects on IDA (Owain Evans et al) (sum­ma­rized by Ni­cholas): This doc­u­ment de­scribes three sug­gested pro­jects build­ing on Iter­ated Distil­la­tion and Am­plifi­ca­tion (IDA), a method for train­ing ML sys­tems while pre­serv­ing al­ign­ment. The first pro­ject is to ap­ply IDA to solv­ing math­e­mat­i­cal prob­lems. The sec­ond is to ap­ply IDA to neu­ral pro­gram in­ter­pre­ta­tion, the prob­lem of repli­cat­ing the in­ter­nal be­hav­ior of other pro­grams as well as their out­puts. The third is to ex­per­i­ment with adap­tive com­pu­ta­tion where com­pu­ta­tional power is di­rected to where it is most use­ful. For each pro­ject, they also in­clude mo­ti­va­tion, di­rec­tions, and re­lated work.

Ni­cholas’s opinion: Figur­ing out an in­ter­est­ing and use­ful pro­ject to work on is one of the ma­jor challenges of any re­search pro­ject, and it may re­quire a dis­tinct skill set from the pro­ject’s im­ple­men­ta­tion. As a re­sult, I ap­pre­ci­ate the au­thors en­abling other re­searchers to jump straight into solv­ing the prob­lems. Given how de­tailed the mo­ti­va­tion, in­struc­tions, and re­lated work are, this doc­u­ment strikes me as an ex­cel­lent way for some­one to be­gin her first re­search pro­ject on IDA or AI safety more broadly. Ad­di­tion­ally, while there are many pub­lic ex­pla­na­tions of IDA, I found this to be one of the most clear and com­plete de­scrip­tions I have read.

Read more: Align­ment Fo­rum sum­mary post

List of re­solved con­fu­sions about IDA (Wei Dai) (sum­ma­rized by Ro­hin): This is a use­ful post clar­ify­ing some of the terms around IDA. I’m not sum­ma­riz­ing it be­cause each point is already quite short.

Mesa optimization

Con­crete ex­per­i­ments in in­ner al­ign­ment (Evan Hub­inger) (sum­ma­rized by Matthew): While the high­lighted posts above go into de­tail about one par­tic­u­lar ex­per­i­ment that could clar­ify the in­ner al­ign­ment prob­lem, this post briefly lays out sev­eral ex­per­i­ments that could be use­ful. One ex­am­ple ex­per­i­ment is giv­ing an RL trained agent di­rect ac­cess to its re­ward as part of its ob­ser­va­tion. Dur­ing test­ing, we could try putting the model in a con­fus­ing situ­a­tion by al­ter­ing its ob­served re­ward so that it doesn’t match the real one. The hope is that we could gain in­sight into when RL trained agents in­ter­nally rep­re­sent ‘goals’ and how they re­late to the en­vi­ron­ment, if they do at all. You’ll have to read the post to see all the ex­per­i­ments.

Matthew’s opinion: I’m cur­rently con­vinced that do­ing em­piri­cal work right now will help us un­der­stand mesa op­ti­miza­tion, and this was one of the posts that lead me to that con­clu­sion. I’m still a bit skep­ti­cal that cur­rent tech­niques are suffi­cient to demon­strate the type of pow­er­ful learned search al­gorithms which could char­ac­ter­ize the worst out­comes for failures in in­ner al­ign­ment. Re­gard­less, I think at this point clas­sify­ing failure modes is quite benefi­cial, and con­duct­ing tests like the ones in this post will make that a lot eas­ier.

Learn­ing hu­man intent

Fine-Tun­ing GPT-2 from Hu­man Prefer­ences (Daniel M. Zie­gler et al) (sum­ma­rized by Sud­han­shu): This blog post and its as­so­ci­ated pa­per de­scribes the re­sults of sev­eral text gen­er­a­tion/​con­tinu­a­tion ex­per­i­ments, where hu­man feed­back on ini­tial/​older sam­ples was used in the form of a re­in­force­ment learn­ing re­ward sig­nal to fine­tune the base 774-mil­lion pa­ram­e­ter GPT-2 lan­guage model (AN #46). The key mo­ti­va­tion here was to un­der­stand whether in­ter­ac­tions with hu­mans can help al­gorithms bet­ter learn and adapt to hu­man prefer­ences in nat­u­ral lan­guage gen­er­a­tion tasks.

They re­port mixed re­sults. For the tasks of con­tin­u­ing text with pos­i­tive sen­ti­ment or phys­i­cally de­scrip­tive lan­guage, they re­port im­proved perfor­mance above the baseline (as as­sessed by ex­ter­nal ex­am­in­ers) af­ter fine-tun­ing on only 5,000 hu­man judg­ments of sam­ples gen­er­ated from the base model. The sum­ma­riza­tion task re­quired 60,000 sam­ples of on­line hu­man feed­back to perform similarly to a sim­ple baseline, lead-3 - which re­turns the first three sen­tences as the sum­mary—as as­sessed by hu­mans.

Some of the les­sons learned while perform­ing this re­search in­clude 1) the need for bet­ter, less am­bigu­ous tasks and la­bel­ling pro­to­cols for sourc­ing higher qual­ity an­no­ta­tions, and 2) a re­minder that “bugs can op­ti­mize for bad be­havi­our”, as a sign er­ror prop­a­gated through the train­ing pro­cess to gen­er­ate “not gib­ber­ish but max­i­mally bad out­put”. The work con­cludes on the note that it is a step to­wards scal­able AI al­ign­ment meth­ods such as de­bate and am­plifi­ca­tion.

Sud­han­shu’s opinion: It is good to see re­search on main­stream NLProc/​ML tasks that in­cludes dis­cus­sions on challenges, failure modes and rele­vance to the broader mo­ti­vat­ing goals of AI re­search.

The work opens up in­ter­est­ing av­enues within OpenAI’s al­ign­ment agenda, for ex­am­ple learn­ing a di­ver­sity of prefer­ences (A OR B), or a hi­er­ar­chy of prefer­ences (A AND B) se­quen­tially with­out catas­trophic for­get­ting.

In or­der to scale, we would want to gen­er­ate au­to­mated la­bel­ers through semi-su­per­vised re­in­force­ment learn­ing, to de­rive the most gains from ev­ery piece of hu­man in­put. The ro­bust­ness of this needs fur­ther em­piri­cal and con­cep­tual in­ves­ti­ga­tion be­fore we can be con­fi­dent that such a sys­tem can work to form a hi­er­ar­chy of learn­ers, e.g. in am­plifi­ca­tion.

Ro­hin’s opinion: One thing I par­tic­u­larly like here is that the eval­u­a­tion is done by hu­mans. This seems sig­nifi­cantly more ro­bust as an eval­u­a­tion met­ric than any au­to­mated sys­tem we could come up with, and I hope that more peo­ple use hu­man eval­u­a­tion in the fu­ture.

Read more: Paper: Fine-Tun­ing Lan­guage Models from Hu­man Preferences

Prevent­ing bad behavior

Ro­bust Change Cap­tion­ing (Dong Huk Park et al) (sum­ma­rized by Dan H): Safe ex­plo­ra­tion re­quires that agents avoid dis­rupt­ing their en­vi­ron­ment. Pre­vi­ous work, such as Krakovna et al. (AN #10), pe­nal­ize an agent’s need­less side effects on the en­vi­ron­ment. For such tech­niques to work in the real world, agents must also es­ti­mate en­vi­ron­ment dis­rup­tions, side effects, and changes while not be­ing dis­tracted by periph­eral and un­af­fect­ing changes. This pa­per pro­poses a dataset to fur­ther the study of “Change Cap­tion­ing,” where scene changes are de­scribed by a ma­chine learn­ing sys­tem in nat­u­ral lan­guage. That is, given be­fore and af­ter images, a sys­tem de­scribes the salient change in the scene. Work on sys­tems that can es­ti­mate changes can likely progress safe ex­plo­ra­tion.


Learn­ing Rep­re­sen­ta­tions by Hu­mans, for Hu­mans (So­phie Hil­gard, Nir Rosen­feld et al) (sum­ma­rized by Asya): His­tor­i­cally, in­ter­pretabil­ity ap­proaches have in­volved ma­chines act­ing as ex­perts, mak­ing de­ci­sions and gen­er­at­ing ex­pla­na­tions for their de­ci­sions. This pa­per takes a slightly differ­ent ap­proach, in­stead us­ing ma­chines as ad­visers who are try­ing to give the best pos­si­ble ad­vice to hu­mans, the fi­nal de­ci­sion mak­ers. Models are given in­put data and trained to gen­er­ate vi­sual rep­re­sen­ta­tions based on the data that cause hu­mans to take the best pos­si­ble ac­tions. In the main ex­per­i­ment in this pa­per, hu­mans are tasked with de­cid­ing whether to ap­prove or deny loans based on de­tails of a loan ap­pli­ca­tion. Ad­vis­ing net­works gen­er­ate re­al­is­tic-look­ing faces whose ex­pres­sions rep­re­sent mul­ti­vari­ate in­for­ma­tion that’s im­por­tant for the loan de­ci­sion. Hu­mans do bet­ter when pro­vided the fa­cial ex­pres­sion ‘ad­vice’, and fur­ther­more can jus­tify their de­ci­sions with analog­i­cal rea­son­ing based on the faces, e.g. “x will likely be re­paid be­cause x is similar to x’, and x’ was re­paid”.

Asya’s opinion: This seems to me like a very plau­si­ble story for how AI sys­tems get in­cor­po­rated into hu­man de­ci­sion-mak­ing in the near-term fu­ture. I do worry that fur­ther down the line, AI sys­tems where AIs are merely ad­vis­ing will get out­com­peted by AI sys­tems do­ing the en­tire de­ci­sion-mak­ing pro­cess. From an in­ter­pretabil­ity per­spec­tive, it also seems to me like hav­ing ‘ad­vice’ that rep­re­sents com­pli­cated mul­ti­vari­ate data still hides a lot of rea­son­ing that could be im­por­tant if we were wor­ried about mis­al­igned AI. I like that the pa­per em­pha­sizes hav­ing hu­mans-in-the-loop dur­ing train­ing and pre­sents an effec­tive mechanism for do­ing gra­di­ent de­scent with hu­man choices.

Ro­hin’s opinion: One in­ter­est­ing thing about this pa­per is its similar­ity to Deep RL from Hu­man Prefer­ences: it also trains a hu­man model, that is im­proved over time by col­lect­ing more data from real hu­mans. The differ­ence is that DRLHP pro­duces a model of the hu­man re­ward func­tion, whereas the model in this pa­per pre­dicts hu­man ac­tions.

Other progress in AI

Re­in­force­ment learning

The Prin­ci­ple of Un­changed Op­ti­mal­ity in Re­in­force­ment Learn­ing Gen­er­al­iza­tion (Alex Ir­pan and Xingyou Song) (sum­ma­rized by Flo): In image recog­ni­tion tasks, there is usu­ally only one la­bel per image, such that there ex­ists an op­ti­mal solu­tion that maps ev­ery image to the cor­rect la­bel. Good gen­er­al­iza­tion of a model can there­fore straight­for­wardly be defined as a good ap­prox­i­ma­tion of the image-to-la­bel map­ping for pre­vi­ously un­seen data.

In re­in­force­ment learn­ing, our mod­els usu­ally don’t map en­vi­ron­ments to the op­ti­mal policy, but states in a given en­vi­ron­ment to the cor­re­spond­ing op­ti­mal ac­tion. The op­ti­mal ac­tion in a state can de­pend on the en­vi­ron­ment. This means that there is a trade­off re­gard­ing the perfor­mance of a model in differ­ent en­vi­ron­ments.

The au­thors sug­gest the prin­ci­ple of un­changed op­ti­mal­ity: in a bench­mark for gen­er­al­iza­tion in re­in­force­ment learn­ing, there should be at least one policy that is op­ti­mal for all en­vi­ron­ments in the train and test sets. With this in place, gen­er­al­iza­tion does not con­flict with good perfor­mance in in­di­vi­d­ual en­vi­ron­ments. If the prin­ci­ple does not ini­tially hold for a given set of en­vi­ron­ments, we can change that by giv­ing the agent more in­for­ma­tion. For ex­am­ple, the agent could re­ceive a pa­ram­e­ter that in­di­cates which en­vi­ron­ment it is cur­rently in­ter­act­ing with.

Flo’s opinion: I am a bit torn here: On one hand, the prin­ci­ple makes it plau­si­ble for us to find the globally op­ti­mal solu­tion by solv­ing our task on a finite set of train­ing en­vi­ron­ments. This way the gen­er­al­iza­tion prob­lem feels more well-defined and amenable to the­o­ret­i­cal anal­y­sis, which seems use­ful for ad­vanc­ing our un­der­stand­ing of re­in­force­ment learn­ing.

On the other hand, I don’t ex­pect the prin­ci­ple to hold for most real-world prob­lems. For ex­am­ple, in in­ter­ac­tions with other adapt­ing agents perfor­mance will de­pend on these agents’ poli­cies, which can be hard to in­fer and change dy­nam­i­cally. This means that the prin­ci­ple of un­changed op­ti­mal­ity won’t hold with­out pre­cise in­for­ma­tion about the other agent’s poli­cies, while this in­for­ma­tion can be very difficult to ob­tain.

More gen­er­ally, with this and some of the crit­i­cism of the AI safety grid­wor­lds that framed them as an ill-defined bench­mark, I am a bit wor­ried that too much fo­cus on very “clean” bench­marks might di­vert from is­sues as­so­ci­ated with the messi­ness of the real world. I would have liked to see a more con­di­tional con­clu­sion for the pa­per, in­stead of a gen­eral prin­ci­ple.

No comments.