Alignment Newsletter #33

Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can sign up, or look through the database of all sum­maries.

One cor­rec­tion to last week’s newslet­ter: the ti­tle Is Ro­bust­ness at the Cost of Ac­cu­racy should have been Is Ro­bust­ness the Cost of Ac­cu­racy.


Re­ward learn­ing from hu­man prefer­ences and demon­stra­tions in Atari (Borja Ibarz et al): We have had lots of work on learn­ing from prefer­ences, demon­stra­tions, proxy re­wards, nat­u­ral lan­guage, rank­ings etc. How­ever, most such work fo­cuses on one of these modes of learn­ing, some­times com­bined with an ex­plicit re­ward func­tion. This work learns to play Atari games us­ing both prefer­ence and demon­stra­tion in­for­ma­tion. They start out with a set of ex­pert demon­stra­tions which are used to ini­tial­ize a policy us­ing be­hav­ioral clon­ing. They also use the demon­stra­tions to train a re­ward model us­ing the DQfD al­gorithm. They then con­tinue train­ing the re­ward and policy si­mul­ta­neously, where the policy is trained on re­wards from the re­ward model, while the re­ward model is trained us­ing prefer­ence in­for­ma­tion (col­lected and used in the same way as Deep RL from Hu­man Prefer­ences) and the ex­pert demon­stra­tions. They then pre­sent a lot of ex­per­i­men­tal re­sults. The main thing I got out of the ex­per­i­ments is that when demon­stra­tions are good (near op­ti­mal), they con­vey a lot of in­for­ma­tion about how to perform the task, lead­ing to high re­ward, but when they are not good, they will ac­tively hurt perfor­mance, since the al­gorithm as­sumes that the demon­stra­tions are high qual­ity and the demon­stra­tions “over­ride” the more ac­cu­rate in­for­ma­tion col­lected via prefer­ences. They also show re­sults on effi­ciency, the qual­ity of the re­ward model, and the re­ward hack­ing that can oc­cur if you don’t con­tinue train­ing the re­ward model alongside the policy.

Ro­hin’s opinion: I’m ex­cited to see work that com­bines in­for­ma­tion from mul­ti­ple sources! In gen­eral with mul­ti­ple sources you have the prob­lem of figur­ing out what to do when the sources of in­for­ma­tion con­flict, and this is no ex­cep­tion. Their ap­proach tends to pri­ori­tize demon­stra­tions over prefer­ences when the two con­flict, and so in cases where the prefer­ences are bet­ter (as in En­duro) their ap­proach performs poorly. I’m some­what sur­prised that they pri­ori­tize de­mos over prefer­ences, since it seems hu­mans would be more re­li­able at pro­vid­ing prefer­ences than de­mos, but per­haps they needed to give de­mos more in­fluence over the policy in or­der to have the policy learn rea­son­ably quickly. I’d be in­ter­ested in see­ing work that tries to use the de­mos as much as pos­si­ble, but de­tect when con­flicts hap­pen and pri­ori­tize the prefer­ences in that situ­a­tion—my guess is that this would let you get good perfor­mance across most Atari games.

Tech­ni­cal AI alignment

Embed­ded agency sequence

Embed­ded Agency (full-text ver­sion) (Scott Garrabrant and Abram Dem­ski): This is the text ver­sion of all of the pre­vi­ous posts in the se­quence.

Iter­ated am­plifi­ca­tion sequence

The Steer­ing Prob­lem (Paul Chris­ti­ano): The steer­ing prob­lem refers to the prob­lem of writ­ing a pro­gram that uses black-box hu­man-level cog­ni­tive abil­ities to be as use­ful as a well-mo­ti­vated hu­man Hugh (that is, a hu­man who is “try­ing” to be helpful). This is a con­cep­tual prob­lem—we don’t have black-box ac­cess to hu­man-level cog­ni­tive abil­ities yet. How­ever, we can build suit­able for­mal­iza­tions and solve the steer­ing prob­lem within those for­mal­iza­tions, from which we can learn gen­er­al­iz­able in­sights that we can ap­ply to the prob­lem we will ac­tu­ally face once we have strong AI ca­pa­bil­ities. For ex­am­ple, we could for­mal­ize “hu­man-level cog­ni­tive abil­ities” as Hugh-level perfor­mance on ques­tion-an­swer­ing (yes-no ques­tions in nat­u­ral lan­guage), on­line learn­ing (given a se­quence of la­beled data points, pre­dict the la­bel of the next data point), or em­bod­ied re­in­force­ment learn­ing. A pro­gram P is more use­ful than Hugh for X if, for ev­ery pro­ject us­ing a simu­la­tion of Hugh to ac­com­plish X, we can effi­ciently trans­form it into a new pro­ject which uses P to ac­com­plish X.

Ro­hin’s opinion: This is an in­ter­est­ing per­spec­tive on the AI safety prob­lem. I re­ally like the ethos of this post, where there isn’t a huge op­po­si­tion be­tween AI ca­pa­bil­ities and AI safety, but in­stead we are sim­ply try­ing to figure out how to use the (helpful!) ca­pa­bil­ities de­vel­oped by AI re­searchers to do use­ful things.

If I think about this from the per­spec­tive of re­duc­ing ex­is­ten­tial risk, it seems like you also need to make the ar­gu­ment that AI sys­tems are un­likely to pose an ex­is­ten­tial threat be­fore they are hu­man-level (a claim I mostly agree with), or that the solu­tions will gen­er­al­ize to sub-hu­man-level AI sys­tems.

Clar­ify­ing “AI Align­ment” (Paul Chris­ti­ano): I pre­vi­ously sum­ma­rized this in AN #2, but I’ll con­sider it in more de­tail now. As Paul uses the term, “AI al­ign­ment” refers only to the prob­lem of figur­ing out how to build an AI that is try­ing to do what hu­mans want. In par­tic­u­lar, an AI can be al­igned but still make mis­takes be­cause of in­com­pe­tence. This is not a for­mal defi­ni­tion, since we don’t have a good way of talk­ing about the “mo­ti­va­tion” of an AI sys­tem, or about “what hu­mans want”, but Paul ex­pects that it will cor­re­spond to some pre­cise no­tion af­ter we make more progress.

Ro­hin’s opinion: Ul­ti­mately, our goal is to build AI sys­tems that re­li­ably do what we want them to do. One way of de­com­pos­ing this is first to define the be­hav­ior that we want from an AI sys­tem, and then to figure out how to ob­tain that be­hav­ior, which we might call the defi­ni­tion-op­ti­miza­tion de­com­po­si­tion. Am­bi­tious value learn­ing aims to solve the defi­ni­tion sub­prob­lem. I in­ter­pret this post as propos­ing a differ­ent de­com­po­si­tion of the over­all prob­lem. One sub­prob­lem is how to build an AI sys­tem that is try­ing to do what we want, and the sec­ond sub­prob­lem is how to make the AI com­pe­tent enough that it ac­tu­ally does what we want. I like this mo­ti­va­tion-com­pe­tence de­com­po­si­tion for a few rea­sons, which I’ve writ­ten a long com­ment about that I strongly en­courage you to read. The sum­mary of that com­ment is: mo­ti­va­tion-com­pe­tence iso­lates the ur­gent part in a sin­gle sub­prob­lem (mo­ti­va­tion), hu­mans are an ex­is­tence proof that the mo­ti­va­tion sub­prob­lem can be solved, it is pos­si­ble to ap­ply the mo­ti­va­tion frame­work to sys­tems with­out lower ca­pa­bil­ities, the safety guaran­tees de­grade slowly and smoothly, the defi­ni­tion-op­ti­miza­tion de­com­po­si­tion as ex­em­plified by ex­pected util­ity max­i­miz­ers has gen­er­ated pri­mar­ily nega­tive re­sults, and mo­ti­va­tion-com­pe­tence al­lows for in­ter­ac­tion be­tween the AI sys­tem and hu­mans. The ma­jor con is that the mo­ti­va­tion-com­pe­tence de­com­po­si­tion is in­for­mal, im­pre­cise, and may be in­tractable to work on.

An un­al­igned bench­mark (Paul Chris­ti­ano): I pre­vi­ously sum­ma­rized this in Re­con #5, but I’ll con­sider it in more de­tail now. The post ar­gues that we could get a very pow­er­ful AI sys­tem us­ing model-based RL with MCTS. Speci­fi­cally, we learn a gen­er­a­tive model of dy­nam­ics (sam­ple a se­quence of ob­ser­va­tions given ac­tions), a re­ward model, and a policy. The policy is trained us­ing MCTS, which uses the dy­nam­ics model and re­ward model to cre­ate and score rol­louts. The dy­nam­ics model is trained us­ing the ac­tual ob­ser­va­tions and ac­tions from the en­vi­ron­ment. The re­ward is trained us­ing prefer­ences or rank­ings (think some­thing like Deep RL from Hu­man Prefer­ences). This is a sys­tem we could pro­gram now, and with suffi­ciently pow­er­ful neu­ral nets, it could out­perform hu­mans.

How­ever, this sys­tem would not be al­igned. There could be speci­fi­ca­tion failures: the AI sys­tem would be op­ti­miz­ing for mak­ing hu­mans think that good out­comes are hap­pen­ing, which may or may not hap­pen by ac­tu­ally hav­ing good out­comes. (There are a few ar­gu­ments sug­gest­ing that this is likely to hap­pen.) There could also be ro­bust­ness failures: as the AI ex­erts more con­trol over the en­vi­ron­ment, there is a dis­tri­bu­tional shift. This may lead to the MCTS find­ing pre­vi­ously un­ex­plored states where the re­ward model ac­ci­den­tally as­signs high re­ward, even though it would be a bad out­come, caus­ing a failure. This may push the en­vi­ron­ment even more out of dis­tri­bu­tion, trig­ger­ing other AI sys­tems to fail as well.

Paul uses this and other po­ten­tial AI al­gorithms as bench­marks to beat—we need to build al­igned AI al­gorithms that achieve similar re­sults as these bench­marks. The fur­ther we are from hit­ting the same met­rics, the larger the in­cen­tive to use the un­al­igned AI al­gorithm.

Iter­ated am­plifi­ca­tion could po­ten­tially solve the is­sues with this al­gorithm. The key idea is to always be able to cash out the learned dy­nam­ics and re­ward mod­els as the re­sult of (a large num­ber of) hu­man de­ci­sions. In ad­di­tion, the mod­els need to be made ro­bust to worst case in­puts, pos­si­bly by us­ing these tech­niques. In or­der to make this work, we need to make progress on ro­bust­ness, am­plifi­ca­tion, and an un­der­stand­ing of what bad be­hav­ior is (so that we can ar­gue that it is easy to avoid, and iter­ated am­plifi­ca­tion does avoid it).

Ro­hin’s opinion: I of­ten think that the hard part of AI al­ign­ment is ac­tu­ally the strate­gic side of it—even if we figure out how to build an al­igned AI sys­tem, it doesn’t help us un­less the ac­tors who ac­tu­ally build pow­er­ful AI sys­tems use our pro­posal. From that per­spec­tive, it’s very im­por­tant for any al­igned sys­tems we build to be com­pet­i­tive with un­al­igned ones, and so keep­ing these sorts of bench­marks in mind seems like a re­ally good idea. This par­tic­u­lar bench­mark seems good—it’s es­sen­tially the AlphaGo al­gorithm, ex­cept with learned dy­nam­ics (since we don’t know the dy­nam­ics of the real world) and re­wards (since we want to be able to spec­ify ar­bi­trary tasks), which seems like a good con­tender for “pow­er­ful AI sys­tem”.

Fixed point sequence

Fixed Point Ex­er­cises (Scott Garrabrant): Scott’s ad­vice to peo­ple who want to learn math in or­der to work on agent foun­da­tions is to learn all of the fixed-point the­o­rems across the differ­ent ar­eas of math. This se­quence will pre­sent a se­ries of ex­er­cises de­signed to teach fixed-point the­o­rems, and will then talk about core ideas in the the­o­rems and how the the­o­rems re­late to al­ign­ment re­search.

Ro­hin’s opinion: I’m not an ex­pert on agent foun­da­tions, so I don’t have an opinion worth say­ing here. I’m not go­ing to cover the posts with ex­er­cises in the newslet­ter—visit the Align­ment Fo­rum for that. I prob­a­bly will cover the posts about how the the­o­rems re­late to agent foun­da­tions re­search.

Agent foundations

Di­men­sional re­gret with­out re­sets (Vadim Kosoy)

Learn­ing hu­man intent

Re­ward learn­ing from hu­man prefer­ences and demon­stra­tions in Atari (Borja Ibarz et al): Sum­ma­rized in the high­lights!

Ac­knowl­edg­ing Hu­man Prefer­ence Types to Sup­port Value Learn­ing (Nandi, Sab­rina, and Erin): Hu­mans of­ten have mul­ti­ple “types” of prefer­ences, which any value learn­ing al­gorithm will need to han­dle. This post con­cen­trates on one par­tic­u­lar frame­work—lik­ing, want­ing and ap­prov­ing. Lik­ing cor­re­sponds to the ex­pe­rience of plea­sure, want­ing cor­re­sponds to the mo­ti­va­tion that causes you to take ac­tion, and ap­prov­ing cor­re­sponds to your con­scious eval­u­a­tion of how good the par­tic­u­lar ac­tion is. Th­ese cor­re­spond to differ­ent data sources, such as fa­cial ex­pres­sions, demon­stra­tions, and rank­ings re­spec­tively. Now sup­pose we ex­tract three differ­ent re­ward func­tions and need to use them to choose ac­tions—how should we ag­gre­gate the re­ward func­tions? They choose some desider­ata on the ag­gre­ga­tion mechanism, in­spired by so­cial choice the­ory, and de­velop a few ag­gre­ga­tion rules that meet some of the desider­ata.

Ro­hin’s opinion: I’m ex­cited to see work on deal­ing with con­flict­ing prefer­ence in­for­ma­tion, par­tic­u­larly from mul­ti­ple data sources. To my knowl­edge, there isn’t any work on this—while there is work on mul­ti­modal in­put, usu­ally those in­puts don’t con­flict, whereas this post ex­plic­itly has sev­eral ex­am­ples of con­flict­ing prefer­ences, which seems like an im­por­tant prob­lem to solve. How­ever, I would aim for a solu­tion that is less fixed (i.e. not one spe­cific ag­gre­ga­tion rule), for ex­am­ple by an ac­tive ap­proach that pre­sents the con­flict to the hu­man and asks how it should be re­solved, and learn­ing an ag­gre­ga­tion rule based on that. I’d be sur­prised if we ended up us­ing a par­tic­u­lar math­e­mat­i­cal equa­tion pre­sented here as an ag­gre­ga­tion mechanism—I’m much more in­ter­ested in what prob­lems arise when we try to ag­gre­gate things, what crite­ria we might want to satisfy, etc.


Towards Govern­ing Agent’s Effi­cacy: Ac­tion-Con­di­tional β-VAE for Deep Trans­par­ent Re­in­force­ment Learn­ing (John Yang et al)


Eval­u­at­ing Ro­bust­ness of Neu­ral Net­works with Mixed In­te­ger Pro­gram­ming (Anony­mous): I’ve only read the ab­stract so far, but this pa­per claims to find the ex­act ad­ver­sar­ial ac­cu­racy of an MNIST clas­sifier within an L in­finity norm ball of ra­dius 0.1, which would be a big step for­ward in the state of the art for ver­ifi­ca­tion.

On a For­mal Model of Safe and Scal­able Self-driv­ing Cars (Shai Shalev-Shwartz et al)


ImageNet-trained CNNs are bi­ased to­wards tex­ture; in­creas­ing shape bias im­proves ac­cu­racy and ro­bust­ness (Anony­mous) (sum­ma­rized by Dan H): This pa­per em­piri­cally demon­strates the out­sized in­fluence of tex­tures in clas­sifi­ca­tion. To ad­dress this, they ap­ply style trans­fer to ImageNet images and train with this dataset. Although train­ing net­works on a spe­cific cor­rup­tion tends to provide ro­bust­ness only to that spe­cific cor­rup­tion, stylized ImageNet images sup­pos­edly lead to gen­er­al­iza­tion to new cor­rup­tion types such as uniform noise and high-pass filters (but not blurs).

Learn­ing Ro­bust Rep­re­sen­ta­tions by Pro­ject­ing Su­perfi­cial Statis­tics Out (Anony­mous)

AI strat­egy and policy

AI de­vel­op­ment in­cen­tive gra­di­ents are not uniformly ter­rible (rk): This post con­sid­ers a model of AI de­vel­op­ment some­what similar to the one in Rac­ing to the precipice pa­per. It notes that un­der this model, as­sum­ing perfect in­for­ma­tion, the util­ity curves for each player are dis­con­tin­u­ous. Speci­fi­cally, the mod­els pre­dict de­ter­minis­ti­cally that the player that spent the most on some­thing (typ­i­cally AI ca­pa­bil­ities) is the one that “wins” the race (i.e. builds AGI), and so there is a dis­con­ti­nu­ity at the point where the play­ers are spend­ing equal amounts of money. This re­sults in play­ers fight­ing as hard as pos­si­ble to be on the right side of the dis­con­ti­nu­ity, which sug­gests that they will skimp on safety. How­ever, in prac­tice, there will be some un­cer­tainty about which player wins, even if you know ex­actly how much each is spend­ing, and this re­moves the dis­con­ti­nu­ity. The re­sult­ing model pre­dicts more in­vest­ment in safety, since buy­ing ex­pected util­ity through safety now looks bet­ter than in­creas­ing the prob­a­bil­ity of win­ning the race (whereas be­fore, it was com­pared against chang­ing from definitely los­ing the race to definitely win­ning the race).

Ro­hin’s opinion: The model in Rac­ing to the precipice had the un­in­tu­itive con­clu­sion that if teams have more in­for­ma­tion (i.e. they know their own or other’s ca­pa­bil­ities), then we be­come less safe, which puz­zled me for a while. Their ex­pla­na­tion is that with max­i­mal in­for­ma­tion, the top team takes as much risk as nec­es­sary in or­der to guaran­tee that they beat the sec­ond team, which can be quite a lot of risk if the two teams are close. While this is true, the ex­pla­na­tion from this post is more satis­fy­ing—since the model has a dis­con­ti­nu­ity that re­wards tak­ing on risk, any­thing that re­moves the dis­con­ti­nu­ity and makes it more con­tin­u­ous will likely im­prove the prospects for safety, such as not hav­ing full in­for­ma­tion. I claim that in re­al­ity these dis­con­ti­nu­ities mostly don’t ex­ist, since (1) we’re un­cer­tain about who will win and (2) we will prob­a­bly have a mul­ti­po­lar sce­nario where even if you aren’t first-to-mar­ket you can still cap­ture a lot of value. This sug­gests that it likely isn’t a prob­lem for teams to have more in­for­ma­tion about each other on the mar­gin.

That said, these mod­els are still very sim­plis­tic, and I mainly try to de­rive qual­i­ta­tive con­clu­sions from them that my in­tu­ition agrees with in hind­sight.

Pr­ereq­ui­si­ties: Rac­ing to the precipice: a model of ar­tifi­cial in­tel­li­gence development

Other progress in AI

Re­in­force­ment learning

Learn­ing La­tent Dy­nam­ics for Plan­ning from Pix­els (Dani­jar Hafner et al) (sum­ma­rized by Richard): The au­thors in­tro­duce PlaNet, an agent that learns an en­vi­ron­ment’s dy­nam­ics from pix­els and then chooses ac­tions by plan­ning in la­tent space. At each step, it searches for the best ac­tion se­quence un­der its Re­cur­rent State Space dy­nam­ics model, then ex­e­cutes the first ac­tion and re­plans. The au­thors note that hav­ing a model with both de­ter­minis­tic and stochas­tic tran­si­tions is crit­i­cal to learn­ing a good policy. They also use a tech­nique called vari­a­tional over­shoot­ing to train the model on multi-step pre­dic­tions, by gen­er­al­is­ing the stan­dard vari­a­tional bound for one-step pre­dic­tions. PlaNet ap­proaches the perfor­mance of top model-free al­gorithms even when trained on 50x fewer epi­sodes.

Richard’s opinion: This pa­per seems like a step for­ward in ad­dress­ing the in­sta­bil­ity of us­ing learned mod­els in RL. How­ever, the ex­tent to which it’s in­tro­duc­ing new con­tri­bu­tions, as op­posed to com­bin­ing ex­ist­ing ideas, is a lit­tle un­clear.

Mo­du­lar Ar­chi­tec­ture for StarCraft II with Deep Re­in­force­ment Learn­ing (Den­nis Lee, Hao­ran Tang et al)

Deep learning

Ap­prox­i­mat­ing CNNs with Bag-of-lo­cal-Fea­tures mod­els works sur­pris­ingly well on ImageNet (Anony­mous) (sum­ma­rized by Dan H): This pa­per pro­poses a bag-of-fea­tures model us­ing patches as fea­tures, and they show that this can ob­tain ac­cu­racy similar to VGGNet ar­chi­tec­tures. They clas­sify each patch and pro­duce the fi­nal clas­sifi­ca­tion by a ma­jor­ity vote; Figure 1 of the pa­per tells all. In some ways this model is more in­ter­pretable than other deep ar­chi­tec­tures, as it is clear which re­gions ac­ti­vated which class. They at­tempt to show that, like their model, VGGNet does not use global shape in­for­ma­tion but in­stead uses lo­cal­ized fea­tures.

Ma­chine learning

For­mal Limi­ta­tions on The Mea­sure­ment of Mu­tual In­for­ma­tion (David McAllester and Karl Stratos)