Align­ment News­let­ter #29


Deep Imit­at­ive Models for Flex­ible In­fer­ence, Plan­ning, and Con­trol (Nich­olas Rhine­hart et al): It’s hard to ap­ply deep RL tech­niques to autonom­ous driv­ing, be­cause we can’t simply col­lect a large amount of ex­per­i­ence with col­li­sions in or­der to learn. However, im­it­a­tion learn­ing is also hard, be­cause as soon as your car de­vi­ates from the ex­pert tra­ject­or­ies that you are im­it­at­ing, you are out of dis­tri­bu­tion, and you could make more mis­takes, lead­ing to ac­cu­mu­lat­ing er­rors un­til you crash. In­stead, we can model the ex­pert’s be­ha­vior, so that we can tell when we are mov­ing out of dis­tri­bu­tion, and take cor­rect­ive ac­tion.

They split up the prob­lem into three dif­fer­ent stages. First, they gen­er­ate a set of way­po­ints along the path to be fol­lowed, which are about 20m away from each other, by us­ing A* search on a map. Next, they use model-based plan­ning us­ing an im­it­at­ive model to gen­er­ate a plan (se­quence of states) that would take the car to the next way­po­int. Fin­ally, they use a simple PID con­trol­ler to choose low-level ac­tions that keep the car on tar­get to­wards the next state in the plan.

The key tech­nical con­tri­bu­tion is with the im­it­at­ive model, which is a prob­ab­il­istic model P(s_{1:T}, G, φ), where φ is the cur­rent ob­ser­va­tion (eg. LIDAR), s_{1:T} is the planned tra­ject­ory, and G is a goal. We can learn P(s_{1:T} | φ) from ex­pert demon­stra­tions. The goal G can be any­thing for which you can write down a spe­cific­a­tion P(G | s_{1:T}, φ). For ex­ample, if you simply want to reach a way­po­int, you can use the nor­mal dis­tri­bu­tion on the dis­tance between the fi­nal state s_T and the way­po­int. You can also in­cor­por­ate a hand-de­signed cost on each state.

They eval­u­ate in sim­u­la­tion on a static world (so no ped­es­tri­ans, for ex­ample). They show de­cent trans­fer from one map to a second map, and also that they can avoid ar­ti­fi­cially in­tro­duced potholes at test time (des­pite not see­ing them at train­ing time), simply by adding a cost on states over a pothole (which they can take into ac­count be­cause they are per­form­ing model-based plan­ning).

Ro­hin’s opin­ion: I really like this pa­per, it show­cases the be­ne­fits of both model-based plan­ning and im­it­a­tion learn­ing. Since the prob­lem has been de­com­posed into a pre­dict­ive model, a goal G, and a plan­ner, we can edit G dir­ectly to get new be­ha­vior at test time without any re­train­ing (as they demon­strate with the pothole ex­per­i­ment). At the same time, they can get away with not spe­cify­ing a full re­ward func­tion, as many fea­tures of good driv­ing, like pas­sen­ger com­fort and stay­ing in the cor­rect lane, are learned simply by im­it­at­ing an ex­pert.

That said, they ini­tially state that one of their goals is to learn from off­line data, even though off­line data typ­ic­ally has no ex­amples of crashes, and “A model ig­nor­ant to the pos­sib­il­ity of a crash can­not know how to pre­vent it”. I think the idea is that you never get into a situ­ation where you could get in a crash, be­cause you never de­vi­ate from ex­pert be­ha­vior since that would have low P(s_{1:T} | φ). This is bet­ter than model-based plan­ning on off­line data, which would con­sider ac­tions that lead to a crash and have no idea what would hap­pen, out­put­ting garbage. However, it still seems that situ­ations could arise where a crash is im­min­ent, which don’t arise much (if at all) in the train­ing data, and the car fails to swerve or brake hard, be­cause it hasn’t seen enough data.

In­ter­pretab­il­ity and Post-Ra­tion­al­iz­a­tion (Vin­cent Van­houcke): Neur­os­cience sug­gests that most ex­plan­a­tions that we hu­mans give for a de­cision are post-hoc ra­tion­al­iz­a­tions, and don’t re­flect the messy un­der­ly­ing true reas­ons for the de­cision. It turns out that de­cision mak­ing, per­cep­tion, and all the other tasks we’re hop­ing to out­source to neural nets are in­her­ently com­plex and dif­fi­cult, and are not amen­able to easy ex­plan­a­tion. We can aim for “from-without” ex­plan­a­tions, which post-hoc ra­tion­al­ize the de­cisions a neural net makes, but “from-within” ex­plan­a­tions, which aim for a mech­an­istic un­der­stand­ing, are in­tract­able. We could try to design mod­els that are more in­ter­pretable (in the “from-within” sense), but this would lead to worse per­form­ance on the ac­tual task, which would hurt every­one, in­clud­ing the people call­ing for more ac­count­ab­il­ity.

Ro­hin’s opin­ion: I take a pretty dif­fer­ent view from this post—I’ve high­lighted it be­cause I think this is an im­port­ant dis­agree­ment that’s rel­ev­ant for align­ment. In par­tic­u­lar, it’s not clear to me that “from-within” in­ter­pretab­il­ity is doomed—while I agree that hu­mans ba­sic­ally only do “from-without” ra­tion­al­iz­a­tions, we also aren’t able to in­spect a hu­man brain in the same way that we can in­spect a neural net. For ex­ample, we can’t see the out­put of each in­di­vidual neuron, we can’t tell what in­put would each neuron would re­spond max­im­ally to, and we can’t pose coun­ter­fac­tu­als with slightly dif­fer­ent in­puts to see what changes. In fact, I think that “from-within” in­ter­pretab­il­ity tech­niques, such as Build­ing Blocks of In­ter­pretab­il­ity have already seen suc­cesses in identi­fy­ing bi­ases that im­age clas­si­fi­ers suf­fer from, that we wouldn’t have known about oth­er­wise.

We could also con­sider whether post-hoc ra­tion­al­iz­a­tion is suf­fi­cient for align­ment. Con­sider a thought ex­per­i­ment where a su­per­in­tel­li­gent AI is about to take a treach­er­ous turn, but there is an ex­plainer AI sys­tem that post-hoc ra­tion­al­izes the out­put of the AI that could warn us in ad­vance. If the ex­plainer AI only gets ac­cess to the out­put of the su­per­in­tel­li­gent AI, I’m very wor­ried—it seems way too easy to come up with some ar­bit­rary ra­tion­al­iz­a­tion for an ac­tion that makes it seem good, you’d have to be have a much more power­ful ex­plainer AI to have a hope. On the other hand, if the ex­plainer AI gets ac­cess to all of the weights and ac­tiv­a­tions that led to the out­put, it seems more likely that this could work—as an ana­logy, I think a teen­ager could tell if I was go­ing to be­tray them, if they could con­stantly eaves­drop on my thoughts.

Tech­nical AI alignment

Learn­ing hu­man intent

Deep Imit­at­ive Models for Flex­ible In­fer­ence, Plan­ning, and Con­trol (Nich­olas Rhine­hart et al): Sum­mar­ized in the high­lights!

Ad­dress­ing Sample Inef­fi­ciency and Re­ward Bias in In­verse Rein­force­ment Learn­ing (Ilya Kostrikov et al): Deep IRL al­gorithms typ­ic­ally work by train­ing a dis­crim­in­ator that dis­tin­guishes between states and ac­tions from the ex­pert from states and ac­tions from the learned policy, and ex­tract­ing a re­ward func­tion from the dis­crim­in­ator. In any en­vir­on­ment where the epis­ode can end after a vari­able num­ber of timesteps, this as­sumes that the re­ward is zero after the epis­ode ends. The re­ward func­tion from the dis­crim­in­ator of­ten takes a form where it must al­ways be pos­it­ive, in­du­cing a sur­vival in­cent­ive, or a form where it must al­ways be neg­at­ive, in­du­cing a liv­ing cost. For ex­ample, GAIL’s re­ward is al­ways pos­it­ive, giv­ing a sur­vival in­cent­ive. As a res­ult, without any re­ward learn­ing at all GAIL does bet­ter on Hop­per than be­ha­vi­oral clon­ing, and fails to learn on a reach­ing or push­ing task (where you want to do the task as quickly as pos­sible, so you want the liv­ing cost). To solve this, they learn an “ab­sorb­ing state re­ward”, which is a re­ward given after the epis­ode ends—this al­lows the al­gorithm to learn for it­self whether it should have a sur­vival in­cent­ive or liv­ing cost.

They also in­tro­duce a ver­sion that keeps a re­play buf­fer of ex­per­i­ence and uses an off-policy al­gorithm to learn from the re­play buf­fer in or­der to im­prove sample ef­fi­ciency.

Ro­hin’s opin­ion: The key in­sight that re­wards are not in­vari­ant to ad­di­tions of a con­stant when you have vari­able-length epis­odes is use­ful and I’m glad that it’s been poin­ted out, and a solu­tion pro­posed. However, the ex­per­i­ments are really strange—in one case (Fig­ure 4, HalfChee­tah) their al­gorithm out­per­forms the ex­pert (which has ac­cess to the true re­ward), and in an­other (Fig­ure 5, right) the blue line im­plies that us­ing a uni­formly zero re­ward lets you achieve around a third of ex­pert per­form­ance (!!).


In­ter­pretab­il­ity and Post-Ra­tion­al­iz­a­tion (Vin­cent Van­houcke): Sum­mar­ized in the high­lights!

San­ity Checks for Sali­ency Maps (Julius Ade­bayo et al)

Ad­versarial examples

Spa­tially Trans­formed Ad­versarial Examples (Ch­ao­wei Xiao et al) (sum­mar­ized by Dan H): Many ad­versarial at­tacks per­turb pixel val­ues, but the at­tack in this pa­per per­turbs the pixel loc­a­tions in­stead. This is ac­com­plished with a smooth im­age de­form­a­tion which has subtle ef­fects for large im­ages. For MNIST im­ages, how­ever, the at­tack is more ob­vi­ous and not ne­ces­sar­ily con­tent-pre­serving (see Fig­ure 2 of the pa­per).

Char­ac­ter­iz­ing Ad­versarial Examples Based on Spa­tial Con­sist­ency In­form­a­tion for Se­mantic Seg­ment­a­tion (Ch­ao­wei Xiao et al) (sum­mar­ized by Dan H): This pa­per con­siders ad­versarial at­tacks on seg­ment­a­tion sys­tems. They find that seg­ment­a­tion sys­tems be­have in­con­sist­ently on ad­versarial im­ages, and they use this in­con­sist­ency to de­tect ad­versarial in­puts. Spe­cific­ally, they take over­lap­ping crops of the im­age and seg­ment each crop. For over­lap­ping crops of an ad­versarial im­age, they find that the seg­ment­a­tion are more in­con­sist­ent. They de­fend against one ad­apt­ive at­tack.


On Cal­ib­ra­tion of Modern Neural Net­works (Chuan Guo et al.) (sum­mar­ized by Dan H): Models should not be un­duly con­fid­ent, es­pe­cially when said con­fid­ence is used for de­cision mak­ing or down­stream tasks. This work provides a simple method to make mod­els more cal­ib­rated so that the con­fid­ence es­tim­ates are closer to the true cor­rect­ness like­li­hood. (For ex­ample, if a cal­ib­rated model pre­dicts “tou­can” with 60% con­fid­ence, then 60% of the time the in­put was ac­tu­ally a tou­can.) Be­fore present­ing their method, they ob­serve that batch nor­mal­iz­a­tion can make mod­els less cal­ib­rated, while un­usu­ally large weight de­cay reg­u­lar­iz­a­tion can in­crease cal­ib­ra­tion. However, their pro­posed ap­proach to in­crease cal­ib­ra­tion does not im­pact ac­cur­acy or re­quire sub­stant­ive model changes. They simply ad­just the tem­per­at­ure of the soft­max to make the model’s “con­fid­ence” (here the max­imum soft­max prob­ab­il­ity) more cal­ib­rated. Spe­cific­ally, after train­ing they tune the soft­max tem­per­at­ure to min­im­ize the cross en­tropy (neg­at­ive av­er­age log-like­li­hood) on val­id­a­tion data. They then meas­ure model cal­ib­ra­tion with a meas­ure which is re­lated to the Brier score, but with ab­so­lute val­ues rather than squares.

Dan H’s opin­ion: Pre­vi­ous cal­ib­ra­tion work in ma­chine learn­ing con­fer­ences would of­ten to fo­cus on cal­ib­rat­ing re­gres­sion mod­els, but this work has re­newed in­terest in cal­ib­rat­ing clas­si­fi­ers. For that reason I view this pa­per highly. That said, this pa­per’s eval­u­ation meas­ure, the “Ex­pec­ted Cal­ib­ra­tion Er­ror” is not a proper scor­ing rule, so op­tim­iz­ing this does not ne­ces­sar­ily lead to cal­ib­ra­tion. In their ap­prox­im­a­tion of the ECE, they use equally-wide bins when there is reason to use ad­apt­ively sized bins. Con­sequently I think Nguyen and O’Con­nor Sec­tions 2 and 3 provide a bet­ter cal­ib­ra­tion ex­plan­a­tion, bet­ter cal­ib­ra­tion meas­ure, and bet­ter es­tim­a­tion pro­ced­ure. They also sug­gest us­ing a con­vex op­tim­iz­a­tion lib­rary to find the soft­max tem­per­at­ure, but at least lib­rar­ies such as CVXPY re­quire far more time and memory than a simple soft­max tem­per­at­ure grid search. Fin­ally, an un­der­stand­able lim­it­a­tion of this work is that it as­sumes test-time in­puts are in-dis­tri­bu­tion, but when in­puts are out-of-dis­tri­bu­tion this method hardly im­proves cal­ib­ra­tion.

Mis­cel­laneous (Align­ment)

AI Align­ment Pod­cast: On Be­com­ing a Moral Real­ist with Peter Singer (Peter Singer and Lu­cas Perry): There’s a fair amount of com­plex­ity in this pod­cast, and I’m not an ex­pert on moral philo­sophy, but here’s an over­sim­pli­fied sum­mary any­way. First, in the same way that we can reach math­em­at­ical truths through reason, we can also ar­rive at moral truths through reason, which sug­gests that they are true facts about the uni­verse (a moral real­ist view). Se­cond, pref­er­ence util­it­ari­an­ism has the prob­lem of fig­ur­ing out which pref­er­ences you want to re­spect, which isn’t a prob­lem with he­donic util­it­ari­an­ism. Be­fore and after the in­ter­view, Lu­cas ar­gues that moral philo­sophy is im­port­ant for AI align­ment. Any stra­tegic re­search “smuggles” in some val­ues, and many tech­nical safety prob­lems, such as pref­er­ence ag­greg­a­tion, would be­ne­fit from a know­ledge of moral philo­sophy. Most im­port­antly, given our cur­rent lack of con­sensus on moral philo­sophy, we should be very wary of lock­ing in our val­ues when we build power­ful AI.

Ro­hin’s opin­ion: I’m not con­vinced that we should be think­ing a lot more about moral philo­sophy. While I agree that lock­ing in a set of val­ues would likely be quite bad, I think this means that re­search­ers should not hard­code a set of val­ues, or cre­ate an AI that in­fers some val­ues and then can never change them. It’s not clear to me why study­ing more moral philo­sophy helps us with this goal. For the other points, it seems not too im­port­ant to get pref­er­ence ag­greg­a­tion or par­tic­u­lar stra­tegic ap­proaches ex­actly per­fect as long as we don’t lock in val­ues—as an ana­logy, we typ­ic­ally don’t ar­gue that politi­cians should be ex­perts on moral philo­sophy, even though they ag­greg­ate pref­er­ences and have large im­pacts on so­ci­ety.

Near-term concerns

Fair­ness and bias

A new course to teach people about fair­ness in ma­chine learn­ing (Sanders Klein­feld): Google has ad­ded a short sec­tion on fair­ness to their Machine Learn­ing Crash Course (MLCC).

Pri­vacy and security

Se­cure Deep Learn­ing Engin­eer­ing: A Soft­ware Qu­al­ity As­sur­ance Per­spect­ive (Lei Ma et al)

Other pro­gress in AI

Rein­force­ment learning

Open sourcing TRFL: a lib­rary of re­in­force­ment learn­ing build­ing blocks (Mat­teo Hes­sel et al) (sum­mar­ized by Richard): Deep­Mind is open-sourcing a Tensor­flow lib­rary of “key al­gorithmic com­pon­ents” used in their RL agents. They hope that this will al­low less buggy RL code.

Richard’s opin­ion: This con­tin­ues the trend of be­ing able to eas­ily im­ple­ment deep learn­ing at higher and higher levels of ab­strac­tion. I’m look­ing for­ward to us­ing it.

CURIOUS: In­trins­ic­ally Mo­tiv­ated Multi-Task, Multi-Goal Rein­force­ment Learn­ing (Cédric Colas et al) (sum­mar­ized by Richard): This pa­per presents an in­trins­ic­ally-mo­tiv­ated al­gorithm (an ex­ten­sion of Univer­sal Value Func­tion Ap­prox­im­at­ors) which learns to com­plete mul­tiple tasks, each para­met­erised by mul­tiple “goals” (e.g. the loc­a­tions of tar­gets). It pri­or­it­ises re­plays of tasks which are neither too easy nor too hard, but in­stead al­low max­imal learn­ing pro­gress; this also help pre­vent cata­strophic for­get­ting by re­fo­cus­ing on tasks which it be­gins to for­get.

Richard’s opin­ion: While I don’t think this pa­per is par­tic­u­larly novel, it use­fully com­bines sev­eral ideas and provides eas­ily-in­ter­pretable res­ults.

Deep learning

Dis­crim­in­ator Re­jec­tion Samp­ling (Samaneh Az­adi et al): Under sim­pli­fy­ing as­sump­tions, GAN train­ing should con­verge to the gen­er­ator mod­el­ling the true data dis­tri­bu­tion while the dis­crim­in­ator al­ways out­puts 0.5. In prac­tice, at the end of train­ing the dis­crim­in­ator can still dis­tin­guish between im­ages from the gen­er­ator and im­ages from the data­set. This sug­gests that we can im­prove the gen­er­ated im­ages by only choos­ing the ones that the dis­crim­in­ator thinks are from the data­set. However, if we use a threshold (re­ject­ing all im­ages where the dis­crim­in­ator is at least X% sure it comes from the gen­er­ator), then we no longer model the true un­der­ly­ing dis­tri­bu­tion, since some low prob­ab­il­ity im­ages could never be gen­er­ated. They in­stead pro­pose a re­jec­tion sampling al­gorithm that still re­cov­ers the data dis­tri­bu­tion un­der strict as­sump­tions, and then re­lax those as­sump­tions to get a prac­tical al­gorithm, and show that it im­proves per­form­ance.

Meta learning

Meta-Learn­ing: A Sur­vey (Joa­quin Van­s­choren) (sum­mar­ized by Richard): This tax­onomy of meta-learn­ing clas­si­fies ap­proaches by the main type of meta-data they learn from:

1. Evalu­ations of other mod­els on re­lated tasks

2. Char­ac­ter­isa­tions of the tasks at hand (and a sim­il­ar­ity met­ric between them)

3. The struc­tures and para­met­ers of re­lated models

Van­s­choren ex­plores a num­ber of dif­fer­ent ap­proaches in each cat­egory.

Cri­tiques (AI)

The 30-Year Cycle In The AI De­bate (Jean-Marie Chauvet)


In­tro­du­cing Stan­ford’s Hu­man-Centered AI Ini­ti­at­ive (Fei-Fei Li et al): Stan­ford will house the Hu­man-centered AI Ini­ti­at­ive (HAI), which will take a mul­tidiscip­lin­ary ap­proach to un­der­stand how to de­velop and de­ploy AI so that it is ro­bustly be­ne­fi­cial to hu­man­ity.

Ro­hin’s opin­ion: It’s al­ways hard to tell from these an­nounce­ments what ex­actly the ini­ti­at­ive will do, but it seems to be fo­cused on mak­ing sure that AI does not make hu­mans ob­sol­ete. In­stead, AI should al­low us to fo­cus more on the cre­at­ive, emo­tional work that we are bet­ter at. Given this, it’s prob­ably not go­ing to fo­cus on AI align­ment, un­like the sim­il­arly named Center for Hu­man-Com­pat­ible AI (CHAI) at Berke­ley. My main ques­tion for the au­thor would be what she would do if we could de­velop AI sys­tems that could re­place all hu­man labor (in­clud­ing cre­at­ive and emo­tional work). Should we not de­velop such AI sys­tems? Is it never go­ing to hap­pen?

Read more: How to Make A.I. That’s Good for People