[AN #72]: Alignment, robustness, methodology, and system building as research priorities for AI safety

Link post

Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can sign up, or look through this spread­sheet of all sum­maries that have ever been in the newslet­ter. I’m always happy to hear feed­back; you can send it to me by re­ply­ing to this email.

Au­dio ver­sion here (may not be up yet).

Highlights

AI Align­ment Re­search Overview (Ja­cob Stein­hardt) (sum­ma­rized by Dan H): It has been over three years since Con­crete Prob­lems in AI Safety. Since that time we have learned more about the struc­ture of the safety prob­lem. This doc­u­ment rep­re­sents an up­dated tax­on­omy of prob­lems rele­vant for AI al­ign­ment. Ja­cob Stein­hardt de­com­poses the re­main­ing tech­ni­cal work into “tech­ni­cal al­ign­ment (the over­com­ing of con­cep­tual or en­g­ineer­ing is­sues needed to cre­ate al­igned AI), de­tect­ing failures (the de­vel­op­ment of tools for proac­tively as­sess­ing the safety/​al­ign­ment of a sys­tem or ap­proach), method­olog­i­cal un­der­stand­ing (best prac­tices backed up by ex­pe­rience), and sys­tem-build­ing (how to tie to­gether the three pre­ced­ing cat­e­gories in the con­text of many en­g­ineers work­ing on a large sys­tem).”

The first topic un­der “tech­ni­cal al­ign­ment” is “Out-of-Distri­bu­tion Ro­bust­ness,” which re­ceives more em­pha­sis than it did in Con­crete Prob­lems. Out-of-Distri­bu­tion Ro­bust­ness is in part mo­ti­vated by the fact that trans­for­ma­tive AI will lead to sub­stan­tial changes to the real world, and we should like our sys­tems to perform well even un­der these large and pos­si­bly rapid data shifts. Spe­cific sub­prob­lems in­clude some work on ad­ver­sar­ial ex­am­ples and out-of-dis­tri­bu­tion de­tec­tion. Next, the prob­lem of Re­ward Learn­ing is de­scribed. For this, there are challenges in­clud­ing learn­ing hu­man val­ues and en­sur­ing those loss­ily rep­re­sented hu­man val­ues can re­main al­igned un­der ex­treme op­ti­miza­tion. While we have at­tained more con­cep­tual clar­ity about re­ward learn­ing since Con­crete Prob­lems, re­ward learn­ing still re­mains largely “un­charted,” and it is still not clear “how ap­proach the prob­lem.” The next sec­tion on Scal­able Re­ward Gen­er­a­tion points out that, in the fu­ture, la­bel­ing mean­ing or pro­vid­ing hu­man over­sight will prove in­creas­ingly difficult. Next, he pro­poses that we ought to study how to make sys­tems “act con­ser­va­tively,” such as en­dow­ing sys­tems with the abil­ity to ac­ti­vate a con­ser­va­tive fal­lback rou­tine when they are un­cer­tain. The fi­nal topic un­der tech­ni­cal al­ign­ment is Coun­ter­fac­tual Rea­son­ing. Here one pos­si­ble di­rec­tion is gen­er­at­ing a fam­ily of simu­lated en­vi­ron­ments to gen­er­ate coun­ter­fac­tu­als.

The “tech­ni­cal al­ign­ment” sec­tion is the ma­jor­ity of this doc­u­ment. Later sec­tions such as “De­tect­ing Failures in Ad­vance” high­light the im­por­tance of deep neu­ral net­work vi­su­al­iza­tion and re­cent model stress-test datasets. “Method­olog­i­cal Un­der­stand­ing” sug­gests that we are more likely to build al­igned AI sys­tems if we im­prove our best prac­tices for build­ing and eval­u­at­ing mod­els, and “Sys­tem Build­ing” spec­u­lates about how to do this for fu­ture multi-faceted ML sys­tems.

Dan H’s opinion: This is a wel­come up­date to Con­crete Prob­lems since it is slightly more con­crete, cur­rent, and dis­cusses im­prov­ing safety in both deep learn­ing and RL rather than mostly RL. While the doc­u­ment men­tions many prob­lems, the set of prob­lems re­tains pre­ci­sion and for­tu­nately does not in­clude ev­ery ca­pa­bil­ities con­cern that may pos­si­bly one day im­pact safety. A take­away is that value learn­ing and model trans­parency still need ground­work, but for­tu­nately other prob­lems in­clud­ing out-of-dis­tri­bu­tion ro­bust­ness are more con­cretized and mostly need time and con­tinued effort.

Ro­hin’s opinion: One thing I par­tic­u­larly like about this agenda is that the con­nec­tion to AI al­ign­ment is sig­nifi­cantly clearer than in Con­crete Prob­lems.

Tech­ni­cal AI alignment

Iter­ated amplification

Ought Progress Up­date Oc­to­ber 2019 (Jung­won Byun and An­dreas Stuh­lmüller) (sum­ma­rized by Ro­hin): While this up­date pro­vides de­tails about Ought as a whole, I will fo­cus only on the re­search they’ve done. As a re­minder, Ought’s goal is to un­der­stand how we can del­e­gate hard ques­tions to ma­chine and hu­man ex­perts. They ini­tially fo­cused on Fac­tored Cog­ni­tion (AN #36), where each ques­tion was re­cur­sively de­com­posed into sub­ques­tions that would help find the an­swer. They now call this “Fac­tored Gen­er­a­tion”, and use “Fac­tored Cog­ni­tion” as the um­brella term for any ap­proach that helps an­swer hard ques­tions us­ing some kind of de­com­po­si­tion.

While Ought has run some ML ex­per­i­ments aimed at au­tomat­ing de­com­po­si­tions, they have pri­mar­ily worked on hu­man ex­per­i­ments for Fac­tored Eval­u­a­tion (AN #62), where the de­com­po­si­tions are used to help eval­u­ate an­swers. They are fo­cus­ing on eval­u­at­ing an­swers in­stead of find­ing an­swers for a few rea­sons. First, there may already ex­ist ex­perts whose an­swers we can use: this can be di­rectly plugged into Fac­tored Eval­u­a­tion, but not Fac­tored Gen­er­a­tion. Se­cond, eval­u­a­tion is of­ten eas­ier than gen­er­a­tion. Third, fac­tored eval­u­a­tion is more amenable to time-sav­ing tricks like only fo­cus­ing on a sin­gle branch of the tree (as in AI safety via de­bate (AN #5)). Fi­nally, when train­ing ML sys­tems, if a model is un­able to ex­actly imi­tate the ex­perts, we need some­thing that can dis­t­in­guish be­tween bet­ter and worse be­hav­ior, which is pos­si­ble with Fac­tored Eval­u­a­tion but not Fac­tored Gen­er­a­tion. (See also Against Mimicry.)

The ex­per­i­ments in­volve a judge try­ing to de­cide which of two an­swers to a ques­tion is best, where one an­swer is pro­vided by an hon­est ex­pert and one by a mal­i­cious ex­pert. (Subques­tions may also get differ­ent an­swers from the two ex­perts.) The goal is to find a setup where the judge re­li­ably picks the hon­est an­swer, re­gard­less of how pow­er­ful the mal­i­cious ex­pert is: if this could be es­tab­lished, it would be strong ev­i­dence that the setup in­cen­tivizes hon­est an­swers. Of course, this re­quires us to quan­tify over all mal­i­cious strate­gies, but hu­mans may not find the best mal­i­cious strate­gies. To miti­gate this, the ex­per­i­ments provide the mal­i­cious ex­pert with lots of ad­van­tages that the hon­est one doesn’t have, to simu­late very pow­er­ful mal­i­cious ex­perts.

They have already learned from their ex­per­i­ments. Ini­tially, they hoped par­ti­ci­pants would de­velop good meta-level strate­gies for de­ter­min­ing the truth. (Meta-level here means that the strate­gies would gen­er­al­ize to other do­mains, e.g. a heuris­tic of always split­ting an an­swer into sep­a­rate claims and ask­ing for the ev­i­dence for each claim sep­a­rately.) They found that these strate­gies don’t emerge or­gan­i­cally, and so are plan­ning to spend con­cen­trated staff time on find­ing good strate­gies. They also found that mal­i­cious ex­perts some­times won due to avoid­able mis­takes, and are hop­ing to elimi­nate this by en­sem­bling work from mul­ti­ple peo­ple for in­creased ro­bust­ness.

Ro­hin’s opinion: This is dis­tinct progress since the last up­date, though I think the ex­per­i­ments are still ex­plo­ra­tory enough that it’s hard to have any big take­aways. The difficulty of gen­er­at­ing good strate­gies sug­gests that it’s par­tic­u­larly im­por­tant that we train our hu­man over­seers (as sug­gested in AI Safety Needs So­cial Scien­tists (AN #47)) to provide the right kind of feed­back, for ex­am­ple if we would like them to re­ward only cor­rigible rea­son­ing (AN #35). I’m par­tic­u­larly ex­cited for the next up­date, where we could see ex­per­i­ments pow­er­ful enough to come to more solid con­clu­sions.

Learn­ing hu­man intent

Norms, Re­wards, and the In­ten­tional Stance: Com­par­ing Ma­chine Learn­ing Ap­proaches to Eth­i­cal Train­ing (Daniel Kasen­berg et al) (sum­ma­rized by Asya) (H/​T Xuan Tan): This pa­per ar­gues that norm in­fer­ence is a plau­si­ble al­ter­na­tive to in­verse re­in­force­ment learn­ing (IRL) for teach­ing a sys­tem what peo­ple want. Ex­ist­ing IRL al­gorithms rely on the Markov as­sump­tion: that the next state of the world de­pends only on the pre­vi­ous state of the world and the ac­tion that the agent takes from that state, rather than on the agent’s en­tire his­tory. In cases where in­for­ma­tion about the past mat­ters, IRL will ei­ther fail to in­fer the right re­ward func­tion, or will be forced to make challeng­ing guesses about what past in­for­ma­tion to en­code in each state. By con­trast, norm in­fer­ence tries to in­fer what (po­ten­tially tem­po­ral) propo­si­tions en­code the re­ward of the sys­tem, keep­ing around only past in­for­ma­tion that is rele­vant to eval­u­at­ing po­ten­tial propo­si­tions. The pa­per ar­gues that norm in­fer­ence re­sults in more in­ter­pretable sys­tems that gen­er­al­ize bet­ter than IRL—sys­tems that use norm in­fer­ence can suc­cess­fully model re­ward-driven agents, but sys­tems that use IRL do poorly at learn­ing tem­po­ral norms.

Asya’s opinion: This pa­per pre­sents an in­ter­est­ing novel al­ter­na­tive to in­verse re­in­force­ment learn­ing and does a good job of ac­knowl­edg­ing po­ten­tial ob­jec­tions. De­cid­ing whether and how to store in­for­ma­tion about the past seems like an im­por­tant prob­lem that in­verse re­in­force­ment learn­ing has to reckon with. My main con­cern with norm in­fer­ence, which the pa­per men­tions, is that op­ti­miz­ing over all pos­si­ble propo­si­tions is in prac­tice ex­tremely slow. I don’t an­ti­ci­pate that norm in­fer­ence will be a perfor­mance-tractable strat­egy un­less a lot of com­pu­ta­tion power is available.

Ro­hin’s opinion: The idea of “norms” used here is very differ­ent from what I usu­ally imag­ine, as in e.g. Fol­low­ing hu­man norms (AN #42). Usu­ally, I think of norms as im­pos­ing a con­straint upon poli­cies rather than defin­ing an op­ti­mal policy, (of­ten) spec­i­fy­ing what not to do rather than what to do, and be­ing a prop­erty of groups of agents, rather than of a sin­gle agent. (See also this com­ment.) The “norms” in this pa­per don’t satisfy any of these prop­er­ties: I would de­scribe their norm in­fer­ence as perform­ing IRL with his­tory-de­pen­dent re­ward func­tions, with a strong in­duc­tive bias to­wards “log­i­cal” re­ward func­tions (which comes from their use of Lin­ear Tem­po­ral Logic). Note that some in­duc­tive bias is nec­es­sary, as with­out in­duc­tive bias his­tory-de­pen­dent re­ward func­tions are far too ex­pres­sive, and noth­ing could be rea­son­ably learned. I think de­spite how it’s writ­ten, the pa­per should be taken not as a de­nounce­ment of IRL-the-paradigm, but a pro­posal for bet­ter IRL al­gorithms that are quite differ­ent from the ones we cur­rently have.

Im­prov­ing Deep Re­in­force­ment Learn­ing in Minecraft with Ac­tion Ad­vice (Spencer Fra­zier et al) (sum­ma­rized by Asya): This pa­per uses maze-traver­sal in Minecraft to look at the ex­tent to which hu­man ad­vice can help with ali­as­ing in 3D en­vi­ron­ments, the prob­lem where many states share nearly iden­ti­cal vi­sual fea­tures. The pa­per com­pares two ad­vice-giv­ing al­gorithms that rely on neu­ral nets which are trained to ex­plore and pre­dict the util­ities of pos­si­ble ac­tions they can take, some­times ac­cept­ing hu­man ad­vice. The two al­gorithms differ pri­mar­ily in whether they provide ad­vice for the cur­rent ac­tion, or provide ad­vice that per­sists for sev­eral ac­tions.

Ex­per­i­men­tal re­sults sug­gest that both al­gorithms, but es­pe­cially the one that ap­plies to mul­ti­ple ac­tions, help with the prob­lem of 3D ali­as­ing, po­ten­tially be­cause the sys­tem can rely on the move­ment ad­vice it got in pre­vi­ous timesteps rather than hav­ing to dis­cern tricky vi­sual fea­tures in the mo­ment. The pa­per also varies the fre­quency and ac­cu­racy of the ad­vice given, and finds that re­ceiv­ing more ad­vice sig­nifi­cantly im­proves perfor­mance, even if that ad­vice is only 50% ac­cu­rate.

Asya’s opinion: I like this pa­per, largely be­cause learn­ing from ad­vice hasn’t been ap­plied much to 3D wor­lds, and this is a com­pel­ling proof of con­cept. I think it’s also a note­wor­thy though ex­pected re­sult that ad­vice that sticks tem­po­rally helps a lot when the ground truth vi­sual ev­i­dence is difficult to in­ter­pret.

Forecasting

Two ex­pla­na­tions for vari­a­tion in hu­man abil­ities (Matthew Bar­nett) (sum­ma­rized by Flo): How quickly might AI ex­ceed hu­man ca­pa­bil­ities? One piece of ev­i­dence is the vari­a­tion of in­tel­li­gence within hu­mans: if there isn’t much vari­a­tion, we might ex­pect AI not to stay at hu­man level in­tel­li­gence for long. It has been ar­gued that vari­a­tion in hu­man cog­ni­tive abil­ities is small com­pared to such vari­a­tion for ar­bi­trary agents. How­ever, the vari­a­tion of hu­man abil­ity in games like chess seems to be quite pro­nounced, and it took chess com­put­ers more than forty years to tran­si­tion from be­gin­ner level to beat­ing the best hu­mans. The blog post pre­sents two ar­gu­ments to rec­on­cile these per­spec­tives:

First, similar minds could have large vari­a­tion in learn­ing abil­ity: If we break a ran­dom part of a com­plex ma­chine, it might perform worse or stop work­ing al­to­gether, even if the bro­ken ma­chine is very similar to the un­bro­ken one. Vari­a­tion in hu­man learn­ing abil­ity might be mostly ex­plain­able by lots of small “bro­ken parts” like harm­ful mu­ta­tions.

Se­cond, small vari­a­tion in learn­ing abil­ity can be con­sis­tent with large vari­a­tion in com­pe­tence, if the lat­ter is ex­plained by vari­a­tion in an­other fac­tor like prac­tice time. For ex­am­ple, a chess match is not very use­ful to de­ter­mine who’s smarter, if one of the play­ers has played a lot more games than the other. This per­spec­tive also re­frames AlphaGo’s su­per­hu­man­ity: the ver­sion that beat Lee Sedol had played around 2000 times as many games as him.

Flo’s opinion: I liked this post and am glad it high­lighted the dis­tinc­tion be­tween learn­ing abil­ity and com­pe­tence that seems to of­ten be ig­nored in de­bates about AI progress. I would be ex­cited to see some fur­ther ex­plo­ra­tion of the “bro­ken parts” model and its im­pli­ca­tion about differ­ing var­i­ances in cog­ni­tive abil­ities be­tween hu­mans and ar­bi­trary in­tel­li­gences.

Mis­cel­la­neous (Align­ment)

Chris Olah’s views on AGI safety (Evan Hub­inger) (sum­ma­rized by Matthew): This post is Evan’s best at­tempt to sum­ma­rize Chris Olah’s views on how trans­parency is a vi­tal com­po­nent for build­ing safe ar­tifi­cial in­tel­li­gence, which he dis­t­in­guishes into four sep­a­rate ap­proaches:

First, we can ap­ply in­ter­pretabil­ity to au­dit our neu­ral net­works, or in other words, catch prob­le­matic rea­son­ing in our mod­els. Se­cond, trans­parency can help safety by al­low­ing re­searchers to de­liber­ately struc­ture their mod­els in ways that sys­tem­at­i­cally work, rather than us­ing ma­chine learn­ing as a black box. Third, un­der­stand­ing trans­parency al­lows us to di­rectly in­cen­tivize for trans­parency in model de­sign and de­ci­sions—similar to how we grade hu­mans on their rea­son­ing (not just the cor­rect an­swer) by hav­ing them show their work. Fourth, trans­parency might al­low us to re­ori­ent the field of AI to­wards micro­scope AI: AI that gives us new ways of un­der­stand­ing the world, en­abling us to be more ca­pa­ble, with­out it­self tak­ing au­tonomous ac­tions.

Chris ex­pects that his main dis­agree­ment with oth­ers is whether good trans­parency is pos­si­ble as mod­els be­come more com­plex. He hy­poth­e­sizes that as mod­els be­come more ad­vanced, they will coun­ter­in­tu­itively be­come more in­ter­pretable, as they will be­gin us­ing more crisp hu­man-re­lat­able ab­strac­tions. Fi­nally, Chris rec­og­nizes that his view im­plies that we might have to re-al­ign the ML com­mu­nity, but he re­mains op­ti­mistic be­cause he be­lieves there’s a lot of low-hang­ing fruit, re­search into in­ter­pretabil­ity al­lows low-bud­get labs to re­main com­pet­i­tive, and in­ter­pretabil­ity is al­igned with the sci­en­tific virtue to un­der­stand our tools.

Matthew’s opinion: Devel­op­ing trans­parency tools is cur­rently my best guess for how we can avoid de­cep­tion and catas­trophic plan­ning in our AI sys­tems. I’m most ex­cited about ap­ply­ing trans­parency tech­niques via the first and third routes, which pri­mar­ily help us au­dit our mod­els. I’m more pes­simistic about the fourth ap­proach be­cause it pre­dictably in­volves re­struc­tur­ing the in­cen­tives for ma­chine learn­ing as a field, which is quite difficult. My opinion might be differ­ent if we could some­how co­or­di­nate the de­vel­op­ment of these tech­nolo­gies.

Mis­con­cep­tions about con­tin­u­ous take­off (Matthew Bar­nett) (sum­ma­rized by Flo): This post at­tempts to clar­ify the au­thor’s no­tion of con­tin­u­ous AI take­off, defined as the growth of fu­ture AI ca­pa­bil­ities be­ing in line with ex­trap­o­la­tion from cur­rent trends. In par­tic­u­lar, that means that no AI pro­ject is go­ing to bring sud­den large gains in ca­pa­bil­ities com­pared to its pre­de­ces­sors.

Such a con­tin­u­ous take­off does not nec­es­sar­ily have to be slow. For ex­am­ple, gen­er­a­tive ad­ver­sar­ial net­works have be­come bet­ter quite rapidly dur­ing the last five years, but progress has still been piece­meal. Fur­ther­more, ex­po­nen­tial gains, for ex­am­ple due to re­cur­sive self-im­prove­ment, can be con­sis­tent with a con­tin­u­ous take­off, as long as the gains from one iter­a­tion of the im­prove­ment pro­cess are mod­est. How­ever, this means that a con­tin­u­ous take­off does not pre­clude large power differ­en­tials from aris­ing: slight ad­van­tages can com­pound over time and ac­tors might use their lead in AI de­vel­op­ment to their strate­gic ad­van­tage even ab­sent dis­con­tin­u­ous progress, much like west­ern Europe used its tech­nolog­i­cal ad­van­tage to con­quer most of the world.

Know­ing whether or not AI take­off hap­pens con­tin­u­ously is im­por­tant for al­ign­ment re­search: A con­tin­u­ous take­off would al­low for more of an at­ti­tude of “deal­ing with things as they come up” and we should shift our fo­cus on spe­cific as­pects that are hard to deal with as they come up. If the take­off is not con­tin­u­ous, an agent might rapidly gain ca­pa­bil­ities rel­a­tive to the rest of civ­i­liza­tion and it be­comes im­por­tant to rule out prob­lems, long be­fore they come up.

Flo’s opinion: I be­lieve that it is quite im­por­tant to be aware of the im­pli­ca­tions that differ­ent forms of take­off should have on our pri­ori­ti­za­tion and am glad that the ar­ti­cle high­lights this. How­ever, I am a bit wor­ried that this very broad defi­ni­tion of con­tin­u­ous progress limits the use­ful­ness of the con­cept. For ex­am­ple, it seems plau­si­ble that a re­cur­sively self-im­prov­ing agent which is very hard to deal with once de­ployed still im­proves its ca­pa­bil­ities slow enough to fit the defi­ni­tion, es­pe­cially if its de­vel­oper has a sig­nifi­cant lead over oth­ers.

AI strat­egy and policy

Spe­cial Re­port: AI Policy and China – Real­ities of State-Led Development

Other progress in AI

Re­in­force­ment learning

Let’s Dis­cuss OpenAI’s Ru­bik’s Cube Re­sult (Alex Ir­pan) (sum­ma­rized by Ro­hin): This post makes many points about OpenAI’s Ru­bik’s cube re­sult (AN #70), but I’m only go­ing to fo­cus on two. First, the re­sult is a ma­jor suc­cess for OpenAI’s fo­cus on de­sign de­ci­sions that en­courage long-term re­search suc­cess. In par­tic­u­lar, it re­lied heav­ily on the en­g­ineer­ing-heavy model surgery and policy dis­til­la­tion ca­pa­bil­ities that al­low them to mod­ify e.g. the ar­chi­tec­ture in the mid­dle of a train­ing run (which we’ve seen with OpenAI Five (AN #19)). Se­cond, the do­main ran­dom­iza­tion doesn’t help as much as you might think: OpenAI needed to put a sig­nifi­cant amount of effort into im­prov­ing the simu­la­tion to get these re­sults, tripling the num­ber of suc­cesses on a face ro­ta­tion task. In­tu­itively, we still need to put in a lot of effort to get­ting the simu­la­tion to be “near” re­al­ity, and then do­main ran­dom­iza­tion can take care of the last lit­tle bit needed to ro­bustly trans­fer to re­al­ity. Given that do­main ran­dom­iza­tion isn’t do­ing that much, it’s not clear if the paradigm of zero-shot sim-to-real trans­fer is the right one to pur­sue. To quote the post’s con­clu­sion: I see two endgames here. In one, robot learn­ing re­duces to build­ing rich simu­la­tors that are well-in­stru­mented for ran­dom­iza­tion, then us­ing lu­dicrous amounts of com­pute across those simu­la­tors. In the other, ran­dom­iza­tion is never good enough to be more than a boot­strap­ping step be­fore real robot data, no mat­ter what the com­pute situ­a­tion looks like. Both seem plau­si­ble to me, and we’ll see how things shake out.

Ro­hin’s opinion: As usual, Alex’s anal­y­sis is spot on, and I have noth­ing to add be­yond strong agree­ment.