[AN #120]: Tracing the intellectual roots of AI and AI alignment

Link post

Align­ment Newslet­ter is a weekly pub­li­ca­tion with re­cent con­tent rele­vant to AI al­ign­ment around the world. Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can look through this spread­sheet of all sum­maries that have ever been in the newslet­ter.

Au­dio ver­sion here (may not be up yet).


The Align­ment Prob­lem (Brian Chris­tian) (sum­ma­rized by Ro­hin): This book starts off with an ex­pla­na­tion of ma­chine learn­ing and prob­lems that we can cur­rently see with it, in­clud­ing de­tailed sto­ries and anal­y­sis of:

- The go­rilla mis­clas­sifi­ca­tion incident

- The faulty re­ward in CoastRunners

- The gen­der bias in lan­guage models

- The failure of fa­cial recog­ni­tion mod­els on minorities

- The COMPAS con­tro­versy (lead­ing up to im­pos­si­bil­ity re­sults in fair­ness)

- The neu­ral net that thought asthma re­duced the risk of pneumonia

It then moves on to agency and re­in­force­ment learn­ing, cov­er­ing from a more his­tor­i­cal and aca­demic per­spec­tive how we have ar­rived at such ideas as tem­po­ral differ­ence learn­ing, re­ward shap­ing, cur­ricu­lum de­sign, and cu­ri­os­ity, across the fields of ma­chine learn­ing, be­hav­ioral psy­chol­ogy, and neu­ro­science. While the con­nec­tions aren’t always ex­plicit, a knowl­edge­able reader can con­nect the aca­demic ex­am­ples given in these chap­ters to the ideas of speci­fi­ca­tion gam­ing (AN #97) and mesa op­ti­miza­tion (AN #58) that we talk about fre­quently in this newslet­ter. Chap­ter 5 es­pe­cially high­lights that agent de­sign is not just a mat­ter of spec­i­fy­ing a re­ward: of­ten, re­wards will do ~noth­ing, and the main re­quire­ment to get a com­pe­tent agent is to provide good shap­ing re­wards or a good cur­ricu­lum. Just as in the pre­vi­ous part, Brian traces the in­tel­lec­tual his­tory of these ideas, pro­vid­ing de­tailed sto­ries of (for ex­am­ple):

- BF Sk­in­ner’s ex­per­i­ments in train­ing pigeons

- The in­ven­tion of the perceptron

- The suc­cess of TD-Gam­mon, and later AlphaGo Zero

The fi­nal part, ti­tled “Nor­ma­tivity”, delves much more deeply into the al­ign­ment prob­lem. While the pre­vi­ous two parts are par­tially or­ga­nized around AI ca­pa­bil­ities—how to get AI sys­tems that op­ti­mize for their ob­jec­tives—this last one tack­les head on the prob­lem that we want AI sys­tems that op­ti­mize for our (of­ten-un­known) ob­jec­tives, cov­er­ing such top­ics as imi­ta­tion learn­ing, in­verse re­in­force­ment learn­ing, learn­ing from prefer­ences, iter­ated am­plifi­ca­tion, im­pact reg­u­lariza­tion, cal­ibrated un­cer­tainty es­ti­mates, and moral un­cer­tainty.

Ro­hin’s opinion: I re­ally en­joyed this book, pri­mar­ily be­cause of the trac­ing of the in­tel­lec­tual his­tory of var­i­ous ideas. While I knew of most of these ideas, and some­times also who ini­tially came up with the ideas, it’s much more en­gag­ing to read the de­tailed sto­ries of how that per­son came to de­velop the idea; Brian’s book de­liv­ers this again and again, func­tion­ing like a well-or­ga­nized liter­a­ture sur­vey that is also fun to read be­cause of its great sto­ry­tel­ling. I strug­gled a fair amount in writ­ing this sum­mary, be­cause I kept want­ing to some­how com­mu­ni­cate the writ­ing style; in the end I de­cided not to do it and to in­stead give a few ex­am­ples of pas­sages from the book in this post.



Clar­ify­ing “What failure looks like” (part 1) (Sam Clarke) (sum­ma­rized by Ro­hin): The first sce­nario out­lined in What failure looks like (AN #50) stems from a failure to spec­ify what we ac­tu­ally want, so that we in­stead build AI sys­tems that pur­sue prox­ies of what we want in­stead. As AI sys­tems be­come re­spon­si­ble for more of the econ­omy, hu­man val­ues be­come less in­fluen­tial rel­a­tive to the proxy ob­jec­tives the AI sys­tems pur­sue, and as a re­sult we lose con­trol over the fu­ture. This post aims to clar­ify whether such a sce­nario leads to lock in, where we are stuck with the state of af­fairs and can­not cor­rect it to get “back on course”. It iden­ti­fies five fac­tors which make this more likely:

1. Col­lec­tive ac­tion prob­lems: Many hu­man in­sti­tu­tions will face com­pet­i­tive (short-term) pres­sures to de­ploy AI sys­tems with bad prox­ies, even if it isn’t in hu­man­ity’s long-term in­ter­est.

2. Reg­u­la­tory cap­ture: In­fluen­tial peo­ple (such as CEOs of AI com­pa­nies) may benefit from AI sys­tems that op­ti­mize prox­ies, and so op­pose mea­sures to fix the is­sue (e.g. by ban­ning such AI sys­tems).

3. Am­bi­guity: There may be gen­uine am­bi­guity about whether it is bet­ter to have these AI sys­tems that op­ti­mize for prox­ies, even from a long-term per­spec­tive, es­pe­cially be­cause all clear and easy-to-define met­rics will likely be go­ing up (since those can be turned into proxy ob­jec­tives).

4. Depen­dency: AI sys­tems may be­come so em­bed­ded in so­ciety that so­ciety can no longer func­tion with­out them.

5. Op­po­si­tion: The AI sys­tems them­selves may op­pose any fixes we pro­pose.

We can also look at his­tor­i­cal prece­dents. Fac­tors 1-3 have played an im­por­tant role in cli­mate change, though if it does lead to lock in, this will be “be­cause of physics”, un­like the case with AI. The agri­cul­tural rev­olu­tion, which ar­guably made hu­man life sig­nifi­cantly worse, still per­sisted thanks to its pro­duc­tivity gains (fac­tor 1) and the loss of hunter-gath­er­ing skills (fac­tor 4). When the Bri­tish colonized New Zealand, the Maori peo­ple lost sig­nifi­cant con­trol over their fu­ture, be­cause each in­di­vi­d­ual chief needed guns (fac­tor 1), trad­ing with the Bri­tish gen­uinely made them bet­ter off ini­tially (fac­tor 3), and even­tu­ally the Bri­tish turned to ma­nipu­la­tion, con­fis­ca­tion and con­flict (fac­tor 5).

With AI in par­tic­u­lar, we might ex­pect that an in­crease in mis­in­for­ma­tion and echo cham­bers ex­ac­er­bates am­bi­guity (fac­tor 3), and that due to its gen­eral-pur­pose na­ture, de­pen­dency (fac­tor 4) may be more of a risk.

The post also sug­gests some fu­ture di­rec­tions for es­ti­mat­ing the sever­ity of lock in for this failure mode.

Ro­hin’s opinion: I think this topic is im­por­tant and the post did it jus­tice. I feel like fac­tors 4 and 5 (de­pen­dency and op­po­si­tion) cap­ture the rea­sons I ex­pect lock in, with fac­tors 1-3 as less im­por­tant but still rele­vant mechanisms. I also re­ally liked the anal­ogy with the Bri­tish coloniza­tion of New Zealand—it felt like it was in fact quite analo­gous to how I’d ex­pect this sort of failure to hap­pen.

“Un­su­per­vised” trans­la­tion as an (in­tent) al­ign­ment prob­lem (Paul Chris­ti­ano) (sum­ma­rized by Ro­hin): We have pre­vi­ously seen that a ma­jor challenge for al­ign­ment is that our mod­els may learn in­ac­cessible in­for­ma­tion (AN #104) that we can­not ex­tract from them, be­cause we do not know how to provide a learn­ing sig­nal to train them to out­put such in­for­ma­tion. This post pro­poses un­su­per­vised trans­la­tion as a par­tic­u­lar con­crete prob­lem to ground this out.

Sup­pose we have lots of English text, and lots of Klin­gon text, but no trans­la­tions from English to Klin­gon (or vice versa), and no bil­in­gual speak­ers. If we train GPT on the text, it will prob­a­bly de­velop a good un­der­stand­ing of both English and Klin­gon, such that it “should” have the abil­ity to trans­late be­tween the two (at least ap­prox­i­mately). How can we get it to ac­tu­ally (try to) do so? Ex­ist­ing meth­ods (both in un­su­per­vised trans­la­tion and in AI al­ign­ment) do not seem to meet this bar.

One vague hope is that we could train a helper agent such that a hu­man can perform next-word pre­dic­tion on Klin­gon with the as­sis­tance of the helper agent, us­ing a method like the one in Learn­ing the prior (AN #109).


Dy­nam­i­cal Dis­tance Learn­ing for Semi-Su­per­vised and Un­su­per­vised Skill Dis­cov­ery (Kris­tian Har­tikainen et al) (sum­ma­rized by Robert): In re­in­force­ment learn­ing (RL), re­ward func­tion speci­fi­ca­tion is a cen­tral prob­lem in train­ing a suc­cess­ful policy. For a large class of tasks, we can frame the prob­lem as goal-di­rected RL: giv­ing a policy a rep­re­sen­ta­tion of a goal (for ex­am­ple co­or­di­nates in a map, or a pic­ture of a lo­ca­tion) and train­ing the policy to reach this goal. In this set­ting, the naive re­ward func­tion would be to give a re­ward of 1 when the policy reaches the goal state (or very close to it), and a re­ward of 0 oth­er­wise. How­ever, this makes it difficult to train the cor­rect policy, as it will need to ex­plore ran­domly for a long time be­fore find­ing the true re­ward. In­stead, if we had a no­tion of dis­tance within the en­vi­ron­ment, we could use the nega­tive dis­tance from the goal state as the re­ward func­tion—this would give the policy good in­for­ma­tion about which di­rec­tion it should be mov­ing in, even if it hasn’t yet found the re­ward.

This pa­per is about how to learn a dis­tance func­tion in an un­su­per­vised man­ner, such that it’s use­ful for shap­ing the re­ward of an RL policy. Given an en­vi­ron­ment with­out a re­ward func­tion, and start­ing with a ran­dom goal-di­rected policy, they al­ter­nate be­tween (1) choos­ing a state s* to train the policy to reach, and (2) train­ing a dis­tance func­tion d(s*, s’) which mea­sures the min­i­mum num­ber of en­vi­ron­ment steps it takes for the policy to reach a state s* from a differ­ent state s’. This dis­tance func­tion is trained with su­per­vised learn­ing us­ing data col­lected by the policy act­ing in the en­vi­ron­ment, and is called the Dy­nam­i­cal Dis­tance, as it mea­sures the dis­tance with re­spect to the en­vi­ron­ment dy­nam­ics and policy be­havi­our.

The key choice in im­ple­ment­ing this al­gorithm is how states are cho­sen to train the policy (step 1). In the first im­ple­men­ta­tion, the au­thors choose the state which is farthest from the cur­rent state or the start­ing state, to en­courage bet­ter long-term plan­ning and skills in the policy and bet­ter gen­er­al­i­sa­tion in the agent. In the sec­ond (and more rele­vant) im­ple­men­ta­tion, the state is cho­sen from a se­lec­tion of ran­dom states by a hu­man who is try­ing to ex­press a prefer­ence for a given goal state. This effec­tively trains the policy to be able to reach states which match hu­mans prefer­ences. This sec­ond method out­performs Deep RL from Hu­man Prefer­ences in terms of sam­ple effi­ciency of hu­man queries in learn­ing hu­man prefer­ences across a range of lo­co­mo­tion tasks.

Robert’s opinion: What’s most in­ter­est­ing about this pa­per (from an al­ign­ment per­spec­tive) is the in­creased sam­ple effi­ciency of the learn­ing of hu­man prefer­ences, by limit­ing the type of prefer­ences that can be ex­pressed to prefer­ences over goal states in a goal-di­rected set­ting. While not all prefer­ences could be cap­tured this way, I think a large amount of them in a large num­ber of set­tings could be—it might come down to cre­at­ing a clever en­cod­ing of the task as goal-di­rected in a way an RL policy could learn.

Align­ing Su­per­hu­man AI and Hu­man Be­hav­ior: Chess as a Model Sys­tem (Reid McIlroy-Young et al) (sum­ma­rized by Ro­hin) (H/​T Dy­lan Had­field-Menell): Cur­rent AI sys­tems are usu­ally fo­cused on some well-defined perfor­mance met­ric. How­ever, as AI sys­tems be­come more in­tel­li­gent, we would pre­sum­ably want to have hu­mans learn from and col­lab­o­rate with such sys­tems. This is cur­rently challeng­ing since our su­per­in­tel­li­gent AI sys­tems are quite hard to un­der­stand and don’t act in hu­man-like ways.

The au­thors aim to study this gen­eral is­sue within chess, where we have ac­cess both to su­per­in­tel­li­gent AI sys­tems and lots of hu­man-gen­er­ated data. (Note: I’ll talk about “rat­ings” be­low; these are not nec­es­sar­ily ELO rat­ings and should just be thought of as some “score” that func­tions similarly to ELO.) The au­thors are in­ter­ested in whether AI sys­tems play in a hu­man-like way and can be used as a way of un­der­stand­ing hu­man game­play. One par­tic­u­larly no­table as­pect of hu­man game­play is that there is a wide range in skill: as a re­sult we would like an AI sys­tem that can make pre­dic­tions con­di­tioned on vary­ing skill lev­els.

For ex­ist­ing al­gorithms, the au­thors an­a­lyze the tra­di­tional Stock­fish en­g­ine and the newer Leela (an open-source ver­sion of AlphaZero (AN #36)). They can get vary­ing skill lev­els by chang­ing the depth of the tree search (in Stock­fish) or chang­ing the amount of train­ing (in Leela).

For Stock­fish, they find that re­gard­less of search depth, Stock­fish ac­tion dis­tri­bu­tions mono­ton­i­cally in­crease in ac­cu­racy as the skill of the hu­man goes up—even when the depth of the search leads to a Stock­fish agent with a similar skill rat­ing as an am­a­teur hu­man. (In other words, if you take a low-ELO Stock­fish agent and treat it as a pre­dic­tive model of hu­man play­ers, it isn’t a great pre­dic­tive model ever, but it is best at pre­dict­ing hu­man ex­perts, not hu­man am­a­teurs.) This demon­strates that Stock­fish plays very differ­ently than hu­mans.

Leela on the other hand is some­what more hu­man-like: when its rat­ing is un­der 2700, its ac­cu­racy is high­est on am­a­teur hu­mans; at a rat­ing of 2700 its ac­cu­racy is about con­stant across hu­mans, and above 2700 its ac­cu­racy is high­est on ex­pert hu­mans. How­ever, its ac­cu­racy is still low, and the most com­pe­tent Leela model is always the best pre­dic­tor of hu­man play (rather than the Leela model with the most similar skill level to the hu­man whose ac­tions are be­ing pre­dicted).

The au­thors then de­velop their own method, Maia. They talk about it as a “mod­ifi­ca­tion of the AlphaZero ar­chi­tec­ture”, but as far as I can tell it is sim­ply be­hav­ior clon­ing us­ing the neu­ral net ar­chi­tec­ture used by Leela. As you might ex­pect, this does sig­nifi­cantly bet­ter, and fi­nally satis­fies the prop­erty we would in­tu­itively want: the best pre­dic­tive model for a hu­man of some skill level is the one that was trained on the data from hu­mans at that skill level.

They also in­ves­ti­gate a bunch of other sce­nar­ios, such as de­ci­sions in which there is a clear best ac­tion and de­ci­sions where hu­mans tend to make mis­takes, and find that the mod­els be­have as you’d ex­pect (for ex­am­ple, when there’s a clear best ac­tion, model ac­cu­racy in­creases across the board).

Ro­hin’s opinion: While I found the mo­ti­va­tion and de­scrip­tion of this pa­per some­what un­clear or mis­lead­ing (Maia seems to me to be iden­ti­cal to be­hav­ior clon­ing, in which case it would not just be a “con­nec­tion”), the ex­per­i­ments they run are pretty cool and it was in­ter­est­ing to see the pretty stark differ­ences be­tween mod­els trained on a perfor­mance met­ric and mod­els trained to imi­tate hu­mans.



Offline Re­in­force­ment Learn­ing: Tu­to­rial, Re­view, and Per­spec­tives on Open Prob­lems (Sergey Lev­ine et al) (sum­ma­rized by Zach): The au­thors in this pa­per give an overview of offline-re­in­force­ment learn­ing with the aim that read­ers gain enough fa­mil­iar­ity to start think­ing about how to make con­tri­bu­tions in this area. The util­ity of a fully offline RL frame­work is sig­nifi­cant: just as su­per­vised learn­ing meth­ods have been able to uti­lize data for gen­er­al­iz­able and pow­er­ful pat­tern recog­ni­tion, offline RL meth­ods could en­able data to be fun­neled into de­ci­sion-mak­ing ma­chines for ap­pli­ca­tions such as health-care, robotics, and recom­mender sys­tems. The or­ga­ni­za­tion of the ar­ti­cle is split into a sec­tion on for­mu­la­tion and an­other on bench­marks, fol­lowed by a sec­tion on ap­pli­ca­tions and a gen­eral dis­cus­sion.

In the for­mu­la­tion por­tion of the re­view, the au­thors give an overview of the offline learn­ing prob­lem and then dis­cuss a num­ber of ap­proaches. Broadly speak­ing, the biggest challenge is the need for coun­ter­fac­tual rea­son­ing be­cause the agent must learn us­ing data by an­other agent. Thus, the agent is forced to rea­son about what would hap­pen if a differ­ent de­ci­sion was used. Im­por­tance sam­pling, ap­prox­i­mate dy­namic pro­gram­ming, and offline model-based ap­proaches are dis­cussed as pos­si­ble ap­proaches to this coun­ter­fac­tual rea­son­ing prob­lem. In the bench­marks sec­tion, the au­thors re­view eval­u­a­tion tech­niques for offline RL meth­ods. While the au­thors find that there are many do­main-spe­cific eval­u­a­tions, gen­eral bench­mark­ing is less well es­tab­lished. A ma­jor is­sue in cre­at­ing bench­marks is de­cid­ing whether or not to use di­verse tra­jec­to­ries/​re­play buffer data, or only the fi­nal ex­pert policy.

In the dis­cus­sion, the au­thors ar­gue that while im­por­tance sam­pling and dy­namic pro­gram­ming work on low-di­men­sional and short-hori­zon tasks, they strug­gle to in­te­grate well with func­tion ap­prox­i­ma­tors. On the other hand, the au­thors see ap­proaches that con­strain the space of poli­cies to be near the dataset as a promis­ing di­rec­tion to miti­gate the effects of dis­tri­bu­tional shift. How­ever, the au­thors ac­knowl­edge that it may ul­ti­mately take more sys­tem­atic datasets to push the field for­ward.

Zach’s opinion: This was a great overview of the state of the field. A re­cur­ring theme that the au­thors high­light is that offline RL re­quires coun­ter­fac­tual rea­son­ing which may be fun­da­men­tally difficult to achieve be­cause of dis­tri­bu­tional shift. Some re­sults shown in the pa­per sug­gest that offline RL may just be fun­da­men­tally hard. How­ever, I find my­self shar­ing op­ti­mism with the au­thors on the sub­ject of policy con­straint tech­niques and the in­evitable im­por­tance of bet­ter datasets.


State of AI Re­port 2020 (Nathan Be­naich and Ian Hog­a­rth) (sum­ma­rized by Ro­hin): The third State of AI (AN #15) re­port is out! I won’t go into de­tails here since there is re­ally quite a lot of in­for­ma­tion, but I recom­mend scrol­ling through the pre­sen­ta­tion to get a sense of what’s been go­ing on. I was par­tic­u­larly in­ter­ested in their 8 pre­dic­tions for the next year: most of them seemed like they were go­ing out on a limb, pre­dict­ing some­thing that isn’t just “the de­fault con­tinues”. On last year’s 6 pre­dic­tions, 4 were cor­rect, 1 was wrong, and 1 was tech­ni­cally wrong but quite close to be­ing cor­rect; even this 67% ac­cu­racy would be pretty im­pres­sive on this year’s 8 pre­dic­tions. (It does seem to me that last year’s pre­dic­tions were more run-of-the-mill, but that might just be hind­sight bias.)


Hiring en­g­ineers and re­searchers to help al­ign GPT-3 (Paul Chris­ti­ano) (sum­ma­rized by Ro­hin): The Reflec­tion team at OpenAI is hiring ML en­g­ineers and ML re­searchers to push for­ward work on al­ign­ing GPT-3. Their most re­cent re­sults are de­scribed in Learn­ing to Sum­ma­rize with Hu­man Feed­back (AN #116).


I’m always happy to hear feed­back; you can send it to me, Ro­hin Shah, by re­ply­ing to this email.


An au­dio pod­cast ver­sion of the Align­ment Newslet­ter is available. This pod­cast is an au­dio ver­sion of the newslet­ter, recorded by Robert Miles.