Alignment Newsletter #29


Deep Imi­ta­tive Models for Flex­ible In­fer­ence, Plan­ning, and Con­trol (Ni­cholas Rhine­hart et al): It’s hard to ap­ply deep RL tech­niques to au­tonomous driv­ing, be­cause we can’t sim­ply col­lect a large amount of ex­pe­rience with col­li­sions in or­der to learn. How­ever, imi­ta­tion learn­ing is also hard, be­cause as soon as your car de­vi­ates from the ex­pert tra­jec­to­ries that you are imi­tat­ing, you are out of dis­tri­bu­tion, and you could make more mis­takes, lead­ing to ac­cu­mu­lat­ing er­rors un­til you crash. In­stead, we can model the ex­pert’s be­hav­ior, so that we can tell when we are mov­ing out of dis­tri­bu­tion, and take cor­rec­tive ac­tion.

They split up the prob­lem into three differ­ent stages. First, they gen­er­ate a set of way­points along the path to be fol­lowed, which are about 20m away from each other, by us­ing A* search on a map. Next, they use model-based plan­ning us­ing an imi­ta­tive model to gen­er­ate a plan (se­quence of states) that would take the car to the next way­point. Fi­nally, they use a sim­ple PID con­trol­ler to choose low-level ac­tions that keep the car on tar­get to­wards the next state in the plan.

The key tech­ni­cal con­tri­bu­tion is with the imi­ta­tive model, which is a prob­a­bil­is­tic model P(s_{1:T}, G, φ), where φ is the cur­rent ob­ser­va­tion (eg. LIDAR), s_{1:T} is the planned tra­jec­tory, and G is a goal. We can learn P(s_{1:T} | φ) from ex­pert demon­stra­tions. The goal G can be any­thing for which you can write down a speci­fi­ca­tion P(G | s_{1:T}, φ). For ex­am­ple, if you sim­ply want to reach a way­point, you can use the nor­mal dis­tri­bu­tion on the dis­tance be­tween the fi­nal state s_T and the way­point. You can also in­cor­po­rate a hand-de­signed cost on each state.

They eval­u­ate in simu­la­tion on a static world (so no pedes­tri­ans, for ex­am­ple). They show de­cent trans­fer from one map to a sec­ond map, and also that they can avoid ar­tifi­cially in­tro­duced pot­holes at test time (de­spite not see­ing them at train­ing time), sim­ply by adding a cost on states over a pot­hole (which they can take into ac­count be­cause they are perform­ing model-based plan­ning).

Ro­hin’s opinion: I re­ally like this pa­per, it show­cases the benefits of both model-based plan­ning and imi­ta­tion learn­ing. Since the prob­lem has been de­com­posed into a pre­dic­tive model, a goal G, and a plan­ner, we can edit G di­rectly to get new be­hav­ior at test time with­out any re­train­ing (as they demon­strate with the pot­hole ex­per­i­ment). At the same time, they can get away with not spec­i­fy­ing a full re­ward func­tion, as many fea­tures of good driv­ing, like pas­sen­ger com­fort and stay­ing in the cor­rect lane, are learned sim­ply by imi­tat­ing an ex­pert.

That said, they ini­tially state that one of their goals is to learn from offline data, even though offline data typ­i­cally has no ex­am­ples of crashes, and “A model ig­no­rant to the pos­si­bil­ity of a crash can­not know how to pre­vent it”. I think the idea is that you never get into a situ­a­tion where you could get in a crash, be­cause you never de­vi­ate from ex­pert be­hav­ior since that would have low P(s_{1:T} | φ). This is bet­ter than model-based plan­ning on offline data, which would con­sider ac­tions that lead to a crash and have no idea what would hap­pen, out­putting garbage. How­ever, it still seems that situ­a­tions could arise where a crash is im­mi­nent, which don’t arise much (if at all) in the train­ing data, and the car fails to swerve or brake hard, be­cause it hasn’t seen enough data.

In­ter­pretabil­ity and Post-Ra­tion­al­iza­tion (Vin­cent Van­houcke): Neu­ro­science sug­gests that most ex­pla­na­tions that we hu­mans give for a de­ci­sion are post-hoc ra­tio­nal­iza­tions, and don’t re­flect the messy un­der­ly­ing true rea­sons for the de­ci­sion. It turns out that de­ci­sion mak­ing, per­cep­tion, and all the other tasks we’re hop­ing to out­source to neu­ral nets are in­her­ently com­plex and difficult, and are not amenable to easy ex­pla­na­tion. We can aim for “from-with­out” ex­pla­na­tions, which post-hoc ra­tio­nal­ize the de­ci­sions a neu­ral net makes, but “from-within” ex­pla­na­tions, which aim for a mechanis­tic un­der­stand­ing, are in­tractable. We could try to de­sign mod­els that are more in­ter­pretable (in the “from-within” sense), but this would lead to worse perfor­mance on the ac­tual task, which would hurt ev­ery­one, in­clud­ing the peo­ple call­ing for more ac­countabil­ity.

Ro­hin’s opinion: I take a pretty differ­ent view from this post—I’ve high­lighted it be­cause I think this is an im­por­tant dis­agree­ment that’s rele­vant for al­ign­ment. In par­tic­u­lar, it’s not clear to me that “from-within” in­ter­pretabil­ity is doomed—while I agree that hu­mans ba­si­cally only do “from-with­out” ra­tio­nal­iza­tions, we also aren’t able to in­spect a hu­man brain in the same way that we can in­spect a neu­ral net. For ex­am­ple, we can’t see the out­put of each in­di­vi­d­ual neu­ron, we can’t tell what in­put would each neu­ron would re­spond max­i­mally to, and we can’t pose coun­ter­fac­tu­als with slightly differ­ent in­puts to see what changes. In fact, I think that “from-within” in­ter­pretabil­ity tech­niques, such as Build­ing Blocks of In­ter­pretabil­ity have already seen suc­cesses in iden­ti­fy­ing bi­ases that image clas­sifiers suffer from, that we wouldn’t have known about oth­er­wise.

We could also con­sider whether post-hoc ra­tio­nal­iza­tion is suffi­cient for al­ign­ment. Con­sider a thought ex­per­i­ment where a su­per­in­tel­li­gent AI is about to take a treach­er­ous turn, but there is an ex­plainer AI sys­tem that post-hoc ra­tio­nal­izes the out­put of the AI that could warn us in ad­vance. If the ex­plainer AI only gets ac­cess to the out­put of the su­per­in­tel­li­gent AI, I’m very wor­ried—it seems way too easy to come up with some ar­bi­trary ra­tio­nal­iza­tion for an ac­tion that makes it seem good, you’d have to be have a much more pow­er­ful ex­plainer AI to have a hope. On the other hand, if the ex­plainer AI gets ac­cess to all of the weights and ac­ti­va­tions that led to the out­put, it seems more likely that this could work—as an anal­ogy, I think a teenager could tell if I was go­ing to be­tray them, if they could con­stantly eaves­drop on my thoughts.

Tech­ni­cal AI alignment

Learn­ing hu­man intent

Deep Imi­ta­tive Models for Flex­ible In­fer­ence, Plan­ning, and Con­trol (Ni­cholas Rhine­hart et al): Sum­ma­rized in the high­lights!

Ad­dress­ing Sam­ple Ineffi­ciency and Re­ward Bias in In­verse Re­in­force­ment Learn­ing (Ilya Kostrikov et al): Deep IRL al­gorithms typ­i­cally work by train­ing a dis­crim­i­na­tor that dis­t­in­guishes be­tween states and ac­tions from the ex­pert from states and ac­tions from the learned policy, and ex­tract­ing a re­ward func­tion from the dis­crim­i­na­tor. In any en­vi­ron­ment where the epi­sode can end af­ter a vari­able num­ber of timesteps, this as­sumes that the re­ward is zero af­ter the epi­sode ends. The re­ward func­tion from the dis­crim­i­na­tor of­ten takes a form where it must always be pos­i­tive, in­duc­ing a sur­vival in­cen­tive, or a form where it must always be nega­tive, in­duc­ing a liv­ing cost. For ex­am­ple, GAIL’s re­ward is always pos­i­tive, giv­ing a sur­vival in­cen­tive. As a re­sult, with­out any re­ward learn­ing at all GAIL does bet­ter on Hop­per than be­hav­ioral clon­ing, and fails to learn on a reach­ing or push­ing task (where you want to do the task as quickly as pos­si­ble, so you want the liv­ing cost). To solve this, they learn an “ab­sorb­ing state re­ward”, which is a re­ward given af­ter the epi­sode ends—this al­lows the al­gorithm to learn for it­self whether it should have a sur­vival in­cen­tive or liv­ing cost.

They also in­tro­duce a ver­sion that keeps a re­play buffer of ex­pe­rience and uses an off-policy al­gorithm to learn from the re­play buffer in or­der to im­prove sam­ple effi­ciency.

Ro­hin’s opinion: The key in­sight that re­wards are not in­var­i­ant to ad­di­tions of a con­stant when you have vari­able-length epi­sodes is use­ful and I’m glad that it’s been pointed out, and a solu­tion pro­posed. How­ever, the ex­per­i­ments are re­ally strange—in one case (Figure 4, HalfChee­tah) their al­gorithm out­performs the ex­pert (which has ac­cess to the true re­ward), and in an­other (Figure 5, right) the blue line im­plies that us­ing a uniformly zero re­ward lets you achieve around a third of ex­pert perfor­mance (!!).


In­ter­pretabil­ity and Post-Ra­tion­al­iza­tion (Vin­cent Van­houcke): Sum­ma­rized in the high­lights!

San­ity Checks for Saliency Maps (Julius Ade­bayo et al)

Ad­ver­sar­ial examples

Spa­tially Trans­formed Ad­ver­sar­ial Ex­am­ples (Chaowei Xiao et al) (sum­ma­rized by Dan H): Many ad­ver­sar­ial at­tacks per­turb pixel val­ues, but the at­tack in this pa­per per­turbs the pixel lo­ca­tions in­stead. This is ac­com­plished with a smooth image de­for­ma­tion which has sub­tle effects for large images. For MNIST images, how­ever, the at­tack is more ob­vi­ous and not nec­es­sar­ily con­tent-pre­serv­ing (see Figure 2 of the pa­per).

Char­ac­ter­iz­ing Ad­ver­sar­ial Ex­am­ples Based on Spa­tial Con­sis­tency In­for­ma­tion for Se­man­tic Seg­men­ta­tion (Chaowei Xiao et al) (sum­ma­rized by Dan H): This pa­per con­sid­ers ad­ver­sar­ial at­tacks on seg­men­ta­tion sys­tems. They find that seg­men­ta­tion sys­tems be­have in­con­sis­tently on ad­ver­sar­ial images, and they use this in­con­sis­tency to de­tect ad­ver­sar­ial in­puts. Speci­fi­cally, they take over­lap­ping crops of the image and seg­ment each crop. For over­lap­ping crops of an ad­ver­sar­ial image, they find that the seg­men­ta­tion are more in­con­sis­tent. They defend against one adap­tive at­tack.


On Cal­ibra­tion of Modern Neu­ral Net­works (Chuan Guo et al.) (sum­ma­rized by Dan H): Models should not be un­duly con­fi­dent, es­pe­cially when said con­fi­dence is used for de­ci­sion mak­ing or down­stream tasks. This work pro­vides a sim­ple method to make mod­els more cal­ibrated so that the con­fi­dence es­ti­mates are closer to the true cor­rect­ness like­li­hood. (For ex­am­ple, if a cal­ibrated model pre­dicts “tou­can” with 60% con­fi­dence, then 60% of the time the in­put was ac­tu­ally a tou­can.) Be­fore pre­sent­ing their method, they ob­serve that batch nor­mal­iza­tion can make mod­els less cal­ibrated, while un­usu­ally large weight de­cay reg­u­lariza­tion can in­crease cal­ibra­tion. How­ever, their pro­posed ap­proach to in­crease cal­ibra­tion does not im­pact ac­cu­racy or re­quire sub­stan­tive model changes. They sim­ply ad­just the tem­per­a­ture of the soft­max to make the model’s “con­fi­dence” (here the max­i­mum soft­max prob­a­bil­ity) more cal­ibrated. Speci­fi­cally, af­ter train­ing they tune the soft­max tem­per­a­ture to min­i­mize the cross en­tropy (nega­tive av­er­age log-like­li­hood) on val­i­da­tion data. They then mea­sure model cal­ibra­tion with a mea­sure which is re­lated to the Brier score, but with ab­solute val­ues rather than squares.

Dan H’s opinion: Pre­vi­ous cal­ibra­tion work in ma­chine learn­ing con­fer­ences would of­ten to fo­cus on cal­ibrat­ing re­gres­sion mod­els, but this work has re­newed in­ter­est in cal­ibrat­ing clas­sifiers. For that rea­son I view this pa­per highly. That said, this pa­per’s eval­u­a­tion mea­sure, the “Ex­pected Cal­ibra­tion Er­ror” is not a proper scor­ing rule, so op­ti­miz­ing this does not nec­es­sar­ily lead to cal­ibra­tion. In their ap­prox­i­ma­tion of the ECE, they use equally-wide bins when there is rea­son to use adap­tively sized bins. Con­se­quently I think Nguyen and O’Con­nor Sec­tions 2 and 3 provide a bet­ter cal­ibra­tion ex­pla­na­tion, bet­ter cal­ibra­tion mea­sure, and bet­ter es­ti­ma­tion pro­ce­dure. They also sug­gest us­ing a con­vex op­ti­miza­tion library to find the soft­max tem­per­a­ture, but at least libraries such as CVXPY re­quire far more time and mem­ory than a sim­ple soft­max tem­per­a­ture grid search. Fi­nally, an un­der­stand­able limi­ta­tion of this work is that it as­sumes test-time in­puts are in-dis­tri­bu­tion, but when in­puts are out-of-dis­tri­bu­tion this method hardly im­proves cal­ibra­tion.

Mis­cel­la­neous (Align­ment)

AI Align­ment Pod­cast: On Be­com­ing a Mo­ral Real­ist with Peter Singer (Peter Singer and Lu­cas Perry): There’s a fair amount of com­plex­ity in this pod­cast, and I’m not an ex­pert on moral philos­o­phy, but here’s an over­sim­plified sum­mary any­way. First, in the same way that we can reach math­e­mat­i­cal truths through rea­son, we can also ar­rive at moral truths through rea­son, which sug­gests that they are true facts about the uni­verse (a moral re­al­ist view). Se­cond, prefer­ence util­i­tar­i­anism has the prob­lem of figur­ing out which prefer­ences you want to re­spect, which isn’t a prob­lem with he­do­nic util­i­tar­i­anism. Be­fore and af­ter the in­ter­view, Lu­cas ar­gues that moral philos­o­phy is im­por­tant for AI al­ign­ment. Any strate­gic re­search “smug­gles” in some val­ues, and many tech­ni­cal safety prob­lems, such as prefer­ence ag­gre­ga­tion, would benefit from a knowl­edge of moral philos­o­phy. Most im­por­tantly, given our cur­rent lack of con­sen­sus on moral philos­o­phy, we should be very wary of lock­ing in our val­ues when we build pow­er­ful AI.

Ro­hin’s opinion: I’m not con­vinced that we should be think­ing a lot more about moral philos­o­phy. While I agree that lock­ing in a set of val­ues would likely be quite bad, I think this means that re­searchers should not hard­code a set of val­ues, or cre­ate an AI that in­fers some val­ues and then can never change them. It’s not clear to me why study­ing more moral philos­o­phy helps us with this goal. For the other points, it seems not too im­por­tant to get prefer­ence ag­gre­ga­tion or par­tic­u­lar strate­gic ap­proaches ex­actly perfect as long as we don’t lock in val­ues—as an anal­ogy, we typ­i­cally don’t ar­gue that poli­ti­ci­ans should be ex­perts on moral philos­o­phy, even though they ag­gre­gate prefer­ences and have large im­pacts on so­ciety.

Near-term concerns

Fair­ness and bias

A new course to teach peo­ple about fair­ness in ma­chine learn­ing (San­ders Kle­in­feld): Google has added a short sec­tion on fair­ness to their Ma­chine Learn­ing Crash Course (MLCC).

Pri­vacy and security

Se­cure Deep Learn­ing Eng­ineer­ing: A Soft­ware Qual­ity As­surance Per­spec­tive (Lei Ma et al)

Other progress in AI

Re­in­force­ment learning

Open sourc­ing TRFL: a library of re­in­force­ment learn­ing build­ing blocks (Mat­teo Hes­sel et al) (sum­ma­rized by Richard): Deep­Mind is open-sourc­ing a Ten­sorflow library of “key al­gorith­mic com­po­nents” used in their RL agents. They hope that this will al­low less buggy RL code.

Richard’s opinion: This con­tinues the trend of be­ing able to eas­ily im­ple­ment deep learn­ing at higher and higher lev­els of ab­strac­tion. I’m look­ing for­ward to us­ing it.

CURIOUS: In­trin­si­cally Mo­ti­vated Multi-Task, Multi-Goal Re­in­force­ment Learn­ing (Cé­dric Co­las et al) (sum­ma­rized by Richard): This pa­per pre­sents an in­trin­si­cally-mo­ti­vated al­gorithm (an ex­ten­sion of Univer­sal Value Func­tion Ap­prox­i­ma­tors) which learns to com­plete mul­ti­ple tasks, each pa­ram­e­ter­ised by mul­ti­ple “goals” (e.g. the lo­ca­tions of tar­gets). It pri­ori­tises re­plays of tasks which are nei­ther too easy nor too hard, but in­stead al­low max­i­mal learn­ing progress; this also help pre­vent catas­trophic for­get­ting by re­fo­cus­ing on tasks which it be­gins to for­get.

Richard’s opinion: While I don’t think this pa­per is par­tic­u­larly novel, it use­fully com­bines sev­eral ideas and pro­vides eas­ily-in­ter­pretable re­sults.

Deep learning

Discrim­i­na­tor Re­jec­tion Sam­pling (Sa­maneh Azadi et al): Un­der sim­plify­ing as­sump­tions, GAN train­ing should con­verge to the gen­er­a­tor mod­el­ling the true data dis­tri­bu­tion while the dis­crim­i­na­tor always out­puts 0.5. In prac­tice, at the end of train­ing the dis­crim­i­na­tor can still dis­t­in­guish be­tween images from the gen­er­a­tor and images from the dataset. This sug­gests that we can im­prove the gen­er­ated images by only choos­ing the ones that the dis­crim­i­na­tor thinks are from the dataset. How­ever, if we use a thresh­old (re­ject­ing all images where the dis­crim­i­na­tor is at least X% sure it comes from the gen­er­a­tor), then we no longer model the true un­der­ly­ing dis­tri­bu­tion, since some low prob­a­bil­ity images could never be gen­er­ated. They in­stead pro­pose a re­jec­tion sam­pling al­gorithm that still re­cov­ers the data dis­tri­bu­tion un­der strict as­sump­tions, and then re­lax those as­sump­tions to get a prac­ti­cal al­gorithm, and show that it im­proves perfor­mance.

Meta learning

Meta-Learn­ing: A Sur­vey (Joaquin Van­schoren) (sum­ma­rized by Richard): This tax­on­omy of meta-learn­ing clas­sifies ap­proaches by the main type of meta-data they learn from:

1. Eval­u­a­tions of other mod­els on re­lated tasks

2. Char­ac­ter­i­sa­tions of the tasks at hand (and a similar­ity met­ric be­tween them)

3. The struc­tures and pa­ram­e­ters of re­lated models

Van­schoren ex­plores a num­ber of differ­ent ap­proaches in each cat­e­gory.

Cri­tiques (AI)

The 30-Year Cy­cle In The AI De­bate (Jean-Marie Chau­vet)


In­tro­duc­ing Stan­ford’s Hu­man-Cen­tered AI Ini­ti­a­tive (Fei-Fei Li et al): Stan­ford will house the Hu­man-cen­tered AI Ini­ti­a­tive (HAI), which will take a mul­ti­dis­ci­plinary ap­proach to un­der­stand how to de­velop and de­ploy AI so that it is ro­bustly benefi­cial to hu­man­ity.

Ro­hin’s opinion: It’s always hard to tell from these an­nounce­ments what ex­actly the ini­ti­a­tive will do, but it seems to be fo­cused on mak­ing sure that AI does not make hu­mans ob­so­lete. In­stead, AI should al­low us to fo­cus more on the cre­ative, emo­tional work that we are bet­ter at. Given this, it’s prob­a­bly not go­ing to fo­cus on AI al­ign­ment, un­like the similarly named Cen­ter for Hu­man-Com­pat­i­ble AI (CHAI) at Berkeley. My main ques­tion for the au­thor would be what she would do if we could de­velop AI sys­tems that could re­place all hu­man la­bor (in­clud­ing cre­ative and emo­tional work). Should we not de­velop such AI sys­tems? Is it never go­ing to hap­pen?

Read more: How to Make A.I. That’s Good for People

No comments.