[AN #100]: What might go wrong if you learn a reward function while acting

Link post

Newslet­ter #100 (!!)

Align­ment Newslet­ter is a weekly pub­li­ca­tion with re­cent con­tent rele­vant to AI al­ign­ment around the world. Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can look through this spread­sheet of all sum­maries that have ever been in the newslet­ter.

Au­dio ver­sion here (may not be up yet).


Pit­falls of learn­ing a re­ward func­tion on­line (Stu­art Arm­strong et al) (sum­ma­rized by Ro­hin): It can be dan­ger­ous to learn the met­ric that you are try­ing to op­ti­mize: if you don’t set it up cor­rectly, you may end up in­cen­tiviz­ing the agent to “up­date in a par­tic­u­lar di­rec­tion” in the met­ric learn­ing for the sake of fu­ture op­ti­miza­tion (a point pre­vi­ously made in Towards In­ter­ac­tive In­verse Re­in­force­ment Learn­ing). This pa­per an­a­lyzes the prob­lems that can arise when an agent si­mul­ta­neously learns a re­ward func­tion, and op­ti­mizes that re­ward func­tion.

The agent may have an in­cen­tive to “rig” the re­ward learn­ing pro­cess, such that it finds a re­ward that is easy to op­ti­mize. For ex­am­ple, con­sider a stu­dent San­dra who must figure out the dead­line and eval­u­a­tion crite­ria for a pro­ject from a teacher Tr­isha. San­dra ex­pects that if she asks Tr­isha when the dead­line is, she will say that the dead­line is later this week. So, San­dra might clev­erly ask, “Is the pro­ject due next week, or the week af­ter”, to which Tr­isha might re­spond “next week”. In this way, San­dra can rig the dead­line-learn­ing pro­cess in or­der to ob­tain a more fa­vor­able dead­line.

Worse, in such sce­nar­ios the need to rig the learn­ing pro­cess can de­stroy value for ev­ery re­ward func­tion you are con­sid­er­ing. For ex­am­ple, let’s sup­pose that if Tr­isha couldn’t be ma­nipu­lated, San­dra’s op­ti­mal policy would be to start the pro­ject to­day, re­gard­less of when the ac­tual dead­line is. How­ever, given that Tr­isha can be ma­nipu­lated, San­dra will spend to­day ma­nipu­lat­ing Tr­isha into set­ting a later dead­line—an ac­tion that seems clearly sub­op­ti­mal from the per­spec­tive of any fixed dead­line. The pa­per de­scribes this as sac­ri­fic­ing re­ward with cer­tainty.

To avoid such situ­a­tions, we need un­rig­gable learn­ing pro­cesses, that is, ones where at all times, the ex­pected fi­nal learned re­ward (dead­line) is in­de­pen­dent of the agent’s (San­dra’s) policy. This un­rig­ga­bil­ity prop­erty is nearly equiv­a­lent to the prop­erty of un­in­fluen­ca­bil­ity, in which we must be able to posit some back­ground vari­ables in the en­vi­ron­ment such that the learn­ing pro­cess can be said to be “learn­ing” these vari­ables. Tech­ni­cally, an un­rig­gable pro­cess need not be un­in­fluence­able, though it usu­ally is (see the pa­per for de­tails).

How­ever, these prop­er­ties only con­strain the ex­pec­ta­tion over en­vi­ron­ments of the fi­nal re­ward dis­tri­bu­tion: it doesn’t pre­vent the agent from some­how shuffling around re­ward func­tions to be matched with suit­able en­vi­ron­ments. For ex­am­ple, with­out know­ing which pro­jects are easy or hard, San­dra could ma­nipu­late Tr­isha into giv­ing early dead­lines for easy pro­jects, and late dead­lines for hard pro­jects, in a man­ner that pre­served the dis­tri­bu­tion over early and late dead­lines; this would satisfy the un­rig­gable prop­erty (and prob­a­bly also the un­in­fluence­able prop­erty, de­pend­ing on the ex­act for­mal­iza­tion).

The au­thors demon­strate these prob­lems in a sim­ple grid­world ex­am­ple. They also point out that there’s a sim­ple way to make any learn­ing pro­cess un­in­fluence­able: choose a spe­cific policy π that gath­ers in­for­ma­tion about the re­ward, and then define the new learn­ing pro­cess to be “what­ever the origi­nal learn­ing pro­cess would have said if you ex­e­cuted π”.

Read more: Blog post: Learn­ing and ma­nipu­lat­ing learning

Ro­hin’s opinion: I would ex­plain this pa­per’s point some­what differ­ently than the pa­per does. Con­sider an AI sys­tem in which we build in a prior over re­wards and an up­date rule, and then have it act in the world. At the end of the tra­jec­tory, it is re­warded ac­cord­ing to the ex­pected re­ward of the tra­jec­tory un­der the in­ferred pos­te­rior over re­wards. Then, the AI sys­tem is in­cen­tivized to choose ac­tions un­der which the re­sult­ing pos­te­rior is easy to max­i­mize.

This doesn’t re­quire the re­ward func­tion to be am­bigu­ous; it just re­quires that the up­date rule isn’t perfect. For ex­am­ple, imag­ine that Alice has a real prefer­ence for ap­ples over ba­nanas, and you use the up­date rule “if Alice eats an ap­ple, in­fer that she likes ap­ples; if Alice eats a ba­nana, in­fer that she likes ba­nanas”. The robot finds it eas­ier to grasp the rigid ap­ple, and so can get higher ex­pected re­ward in the wor­lds where Alice likes ap­ples. If you train a robot in the man­ner above, then the robot will learn to throw away the ba­nanas, so that Alice’s only choice is an ap­ple (that we as­sume she then eats), al­low­ing the robot to “in­fer” that Alice likes ap­ples, which it can then eas­ily max­i­mize. This sort of prob­lem could hap­pen in most cur­rent re­ward learn­ing se­tups, if we had pow­er­ful enough op­ti­miz­ers.

It seems to me that the prob­lem is that you are train­ing the ac­tor, but not train­ing the up­date rule, and so the ac­tor learns to “trick” the up­date rule. In­stead, it seems like we should train both. This is kind of what hap­pens with as­sis­tance games /​ CIRL (AN #69), in which you train a policy to max­i­mize ex­pected re­ward un­der the prior, and so the policy is in­cen­tivized to take the best in­for­ma­tion gath­er­ing ac­tions (which, if you squint, is like “train­ing to up­date well”), and to max­i­mize what it thinks is the true re­ward. Of course, if your prior /​ up­date rule within the game are mis­speci­fied, then bad things can hap­pen. See also Stu­art’s re­ac­tions here and here, as well as my com­ments on those posts.



Eval­u­at­ing Ex­plain­able AI: Which Al­gorith­mic Ex­pla­na­tions Help Users Pre­dict Model Be­hav­ior? (Peter Hase et al) (sum­ma­rized by Robert): In this pa­per the au­thors perform user tests on 5 differ­ent model ag­nos­tic in­ter­pretabil­ity meth­ods: LIME, An­chor, De­ci­sion Boundary, Pro­to­type Model and a Com­pos­ite model (LIME An­chor and De­ci­sion Boundary). The use cases they test are a tab­u­lar dataset pre­dict­ing in­come, and a movie-re­view dataset pre­dict­ing sen­ti­ment of the re­view from a sin­gle sen­tence.

Their ex­per­i­men­tal setup con­sists of 2 tests: for­ward pre­dic­tion and coun­ter­fac­tual pre­dic­tion. In for­ward pre­dic­tion, the user is shown 16 ex­am­ples of in­puts and cor­re­spond­ing out­puts and ex­pla­na­tions, and then must pre­dict the model’s out­put on new in­puts (with­out the ex­pla­na­tion, which of­ten gives away the an­swer). In coun­ter­fac­tual pre­dic­tion, af­ter see­ing 16 ex­am­ples, the user is given an in­put-out­put-ex­pla­na­tion triple, and then must pre­dict how the out­put changes for a spe­cific per­tur­ba­tion of the in­put.

Through­out the re­sults they use a sig­nifi­cance thresh­old of p < 0.05 (they don’t use Bon­fer­roni cor­rec­tions). Their study has re­sponses from 32 differ­ent stu­dents who’d taken at least 1 com­puter sci­ence course, with some screened out for out­liers or low ac­cu­racy dur­ing train­ing. There are ap­prox­i­mately 200 in­di­vi­d­ual pre­dic­tions for each method/​dataset-type com­bi­na­tion, and each method/​pre­dic­tion-type com­bi­na­tion.

Over­all, their re­sults show that only LIME (Lo­cal In­ter­pretable Model-ag­nos­tic Ex­pla­na­tion) helps im­prove perfor­mance with statis­ti­cal sig­nifi­cance on the tab­u­lar dataset across both pre­dic­tion set­tings, and only the Pro­to­type model in coun­ter­fac­tual pre­dic­tion across both datasets. No other re­sult was statis­ti­cally sig­nifi­cant. The im­prove­ment in ac­cu­racy for the statis­ti­cally sig­nifi­cant re­sults is around 10% (from 70% to 80% in the Tab­u­lar dataset with LIME, and 63% to 73% for Pro­to­type in coun­ter­fac­tual pre­dic­tion).

They also showed that user’s rat­ings of the ex­pla­na­tion method didn’t cor­re­late in a statis­ti­cally sig­nifi­cant way with the im­prove­ment the model gave to their pre­dic­tions.

Robert’s opinion: I’m happy a pa­per like this ex­ists, be­cause I think this kind of work is cru­cial in eval­u­at­ing whether in­ter­pretabil­ity meth­ods we’re build­ing are ac­tu­ally use­ful. I’m not sur­prised by the re­sults, be­cause this hasn’t been done rigor­ously be­fore, so re­searchers have never had any idea whether their method has pro­duced good ex­pla­na­tions or not.

The study is weak­ened by the low sam­ple size, which makes many of the p-val­ues not sig­nifi­cant. My in­tu­ition says a few more of the meth­ods would pro­duce statis­ti­cally sig­nifi­cant pos­i­tive re­sults in one of the do­mains/​pre­dic­tion set­tings if the sam­ple size was big­ger, but it seems like some set­tings (for­ward pre­dic­tion, and tex­tual data) are very hard to im­prove, with none of the meth­ods get­ting a bet­ter im­prove­ment in perfor­mance than 5.7% (which had a p-value of 0.197).

A re­ally in­ter­est­ing point is the lack of strong cor­re­la­tion be­tween user-prefer­ence and perfor­mance im­prove­ment. This could be ex­plained by the fact that most of the meth­ods are in­effec­tive at perfor­mance im­prove­ment, but it seems plau­si­ble (to me) that it could hold even if some meth­ods were effec­tive: If the model be­havi­our be­ing ex­plained can’t be ex­plained cleanly, then meth­ods which do ex­plain the be­havi­our might pro­duce messy and con­fus­ing (but true) ex­pla­na­tions and hence get lower rat­ings from users than meth­ods which give clean and clear (but false) ex­pla­na­tions. I think this stems from the prob­lem of a lack of defi­ni­tion of what ex­actly the goal is for these in­ter­pre­ta­tion meth­ods. Without a goal in mind, it’s im­pos­si­ble to mea­sure whether the method achieves this goal. I think work­ing to­wards some form of quan­tifi­able mea­sure­ment is use­ful par­tic­u­larly for com­par­ing meth­ods as, if this study’s ev­i­dence is any­thing to go by, ask­ing hu­mans to eval­u­ate the model’s out­put might not be the most use­ful eval­u­a­tion.

Towards In­ter­pretable Re­in­force­ment Learn­ing Us­ing At­ten­tion Aug­mented Agents (Alexan­der Mott et al) (sum­ma­rized by Robert): In this pa­per the au­thors train a re­in­force­ment learn­ing agent with a soft at­ten­tion mod­ule built into it. The at­ten­tion mod­ule forms a bot­tle­neck be­tween the vi­sual in­put and the net­work choos­ing the next ac­tion, which forces the model to learn to at­tend to only im­por­tant parts of the scene. This means they can vi­su­al­ise which parts of the in­put the model thinks are im­por­tant, as those are the parts of the in­put that the model is at­tend­ing to. The queries to the at­ten­tion model are de­ter­mined by a top level re­cur­rent net­work, with­out in­put from the cur­rent image, so act as a form of “top down” at­ten­tion, where the top con­trol­ler can be imag­ined to be query­ing the pro­cessed image for var­i­ous lo­ca­tions and ob­jects.

Hav­ing trained this agent (which still gets com­pet­i­tive perfor­mance with SOTA RL mod­els on a fair few ATARI games), they qual­i­ta­tively eval­u­ate the at­ten­tion vi­su­al­i­sa­tion on a va­ri­ety of games. They find sev­eral com­mon strate­gies in the at­ten­tion schemes, such as the agents at­tend­ing to spe­cific points un­til an ob­ject crosses the point (“Trip­wires”). The at­ten­tion is com­puted over both reg­u­lar pix­els, as well as Fourier-based po­si­tional en­cod­ing. Thanks to this and other as­pects of their ar­chi­tec­ture, the au­thors can check whether the queries are fo­cused on pixel val­ues (i.e. look­ing for a spe­cific pat­tern of pix­els any­where) or on lo­ca­tion fea­tures (i.e. ask­ing what pix­els are pre­sent at a spe­cific lo­ca­tion). For ex­am­ple, they find that the agent of­ten queries the lo­ca­tion where the score is dis­played, pre­sum­ably be­cause it is use­ful for calcu­lat­ing the value func­tion. They also com­pare their method with self-at­ten­tion based mod­els, and with other saliency meth­ods.

The best way to get a feel for the vi­su­al­i­sa­tions is to go to the pa­per’s web­site and watch the ex­am­ple videos.

Read more: The pa­per’s website

Robert’s opinion: This pa­per isn’t rev­olu­tion­ary in its ap­proach, but it’s in­ter­est­ing to see work on in­ter­pret­ing RL agents, and the fact that the in­ter­pretabil­ity is built-in is in­ter­est­ing: it gives us a harder guaran­tee that this vi­su­al­i­sa­tion is ac­tu­ally show­ing us the parts of the in­put that the model thinks of as im­por­tant, as they ac­tu­ally are im­por­tant in its pro­cess­ing. It’s promis­ing to see that the in-built in­ter­pretabil­ity also didn’t seem to pe­nal­ise the perfor­mance much—it would be in­ter­est­ing to see this method ap­plied to other, stronger kinds of mod­els and see whether it still pro­duces use­ful vi­su­al­i­sa­tions and how it af­fects their perfor­mance.


AI Gover­nance Ca­reer Paths for Euro­peans (Anony­mous) (sum­ma­rized by Ro­hin): Ex­actly what it sounds like.


A Guide to Writ­ing the NeurIPS Im­pact State­ment (Carolyn Ashurst et al) (sum­ma­rized by Ni­cholas): NeurIPS 2020 re­quires pa­per sub­mis­sions to in­clude a state­ment on the broader im­pact of their work. This post pro­vides a guide for how to write an effec­tive im­pact state­ment. They recom­mend fo­cus­ing on the most sig­nifi­cant, ne­glected, and tractable im­pacts, both pos­i­tive and nega­tive, while also con­vey­ing the un­cer­tain­ties in­volved. They also sug­gest in­te­grat­ing this into the re­search pro­cess by read­ing the tech gov­er­nance liter­a­ture and build­ing in­sti­tu­tional struc­tures, and in­clud­ing this in­for­ma­tion in in­tro­duc­tions.

Their guide then recom­mends con­sid­er­ing 3 ques­tions:

How could your re­search af­fect ML ap­pli­ca­tions?

What are the so­cietal im­pli­ca­tions of these ap­pli­ca­tions?

What re­search or other ini­ti­a­tives could im­prove so­cial out­comes?

There is more in­for­ma­tion in the guide on how to go about an­swer­ing those ques­tions, along with some ex­am­ples.

Ni­cholas’s opinion: I am definitely in fa­vor of con­sid­er­ing the im­pacts of ML re­search be­fore con­duct­ing or pub­lish­ing it. I think the field is cur­rently ei­ther at or near a thresh­old where pa­pers will start hav­ing sig­nifi­cant real world effects. While I don’t think this re­quire­ment will be suffi­cient for en­sur­ing pos­i­tive out­comes, I am glad NeurIPS is try­ing it out.

I think the ar­ti­cle makes very strong points and will im­prove the qual­ity of the im­pact state­ments that get sub­mit­ted. I par­tic­u­larly liked the point about com­mu­ni­cat­ing un­cer­tainty, which is a norm that I think the ML com­mu­nity would benefit from greatly. One thing I would add here is that giv­ing ex­plicit prob­a­bil­ities is of­ten more helpful than vague words like “might” or “could”.



“Other-Play” for Zero-Shot Co­or­di­na­tion (Hengyuan Hu et al) (sum­ma­rized by Ro­hin): How can we build AI sys­tems that can co­or­di­nate with hu­mans? While past (AN #70) work (AN #70) has as­sumed ac­cess to some amount of hu­man data, this pa­per aims to co­or­di­nate with­out any hu­man data at all, which they call zero-shot co­or­di­na­tion. In or­der to de­velop an al­gorithm, they as­sume that their part­ner is also “trained” for zero-shot co­or­di­na­tion.

Their key idea is that in zero-shot co­or­di­na­tion, since you can’t break sym­me­tries by agree­ing upon a pro­to­col in ad­vance (i.e. you can’t agree on things like “we’ll drive on the left, not the right”), you need a policy that is ro­bust to re­la­bel­ings that pre­serve these sym­me­tries. This is easy to train for: you just train in self-play, but ran­domly re­la­bel the states, ac­tions and ob­ser­va­tions sep­a­rately for each side in a way that pre­serves the MDP struc­ture (i.e. uses one of the sym­me­tries). Thus, each side must play a policy that works well with­out know­ing how the other agent’s ob­ser­va­tions and ac­tions have been re­la­beled. In prac­tice, for an N-player game you only need to ran­dom­ize N-1 of the re­la­bel­ings, and so in the two player games they con­sider they only ran­domly re­la­bel one side of the self-play.

They eval­u­ate this in Han­abi (where the game is in­var­i­ant to re­la­bel­ing of the col­ors), and show that the re­sult­ing agents are bet­ter at play­ing with other agents trained on differ­ent seeds or with slightly differ­ent ar­chi­tec­tures, and also that they play bet­ter with hu­mans, achiev­ing an av­er­age score of 15.75 with non-ex­pert hu­man play­ers, com­pared to 9.15 for agents trained via reg­u­lar self-play.

Ro­hin’s opinion: For com­par­i­son, I think I get around 17-22 when play­ing with new play­ers, out of a max of 25, so 15.75 is quite a healthy score given that it doesn’t use any hu­man data. That be­ing said, it seems hard to use this method in other set­tings—even in the rel­a­tively sim­ple Over­cooked en­vi­ron­ment (AN #70), there aren’t any ob­vi­ous sym­me­tries to use for such train­ing. Per­haps fu­ture work will al­low us to find ap­prox­i­mate sym­me­tries in games some­how, that we can then train to be ro­bust to?

Towards Learn­ing Multi-agent Ne­go­ti­a­tions via Self-Play (Yichuan Char­lie Tang) (sum­ma­rized by Ro­hin): While the pre­vi­ous pa­per in­tro­duces other-play to be­come ro­bust to un­known part­ners, this pa­per takes the other ap­proach of sim­ply train­ing an agent that is ro­bust to a wide, di­verse pop­u­la­tion of pos­si­ble agents. In par­tic­u­lar, it stud­ies a self-driv­ing car “zip­per merge” en­vi­ron­ment, and trains an agent to be ro­bust to a va­ri­ety of rule-based agents, as well as past ver­sions of it­self, and finds that this leads to a much more suc­cess­ful merg­ing policy. How­ever, this is eval­u­ated against the pop­u­la­tion it is trained with, and not against any pre­vi­ously un­seen agents.

Build­ing AI that can mas­ter com­plex co­op­er­a­tive games with hid­den in­for­ma­tion (Adam Lerer et al) (sum­ma­rized by Flo): This pa­per im­proves on the state of the art for AI agents play­ing Han­abi (AN #45), a co­op­er­a­tive mul­ti­player game that is challeng­ing be­cause of dis­tributed hid­den in­for­ma­tion and re­stricted com­mu­ni­ca­tion.

The ap­proach works by im­prov­ing a baseline policy us­ing search. In the sim­plest case, only one agent performs search while all other agents fol­low a fixed policy, such that the prob­lem is re­duced to search in a POMDP. This alone leads to rele­vant im­prove­ments, even when the search is very shal­low. The fixed poli­cies help be­cause they al­low the search­ing agent to cor­rectly up­date its be­lief about hid­den in­for­ma­tion when it sees other agents be­hav­ing (as it knows how other agents would be­have given differ­ent states of the hid­den in­for­ma­tion). This idea can be gen­er­al­ized to the case where all agents perform search by let­ting the agents simu­late each other’s search pro­cess. This can get ex­pen­sive quickly as agent A’s be­liefs in the sec­ond round also de­pend on agent B’s search pro­cess in coun­ter­fac­tual sce­nar­ios in the first round, such that agent B’s search in round two also has to simu­late these coun­ter­fac­tu­als. A com­pu­ta­tion bud­get is in­tro­duced to make this com­pu­ta­tion­ally fea­si­ble and all agents know that the other agents will only use search in a turn if the cost of this is be­low the bud­get.

As search can be performed on top of any policy and al­lows to lev­er­age com­pute dur­ing in­fer­ence, not just train­ing, it nicely com­ple­ments more di­rect ap­proaches us­ing deep RL, which is a theme that has also been ob­served in Go and Poker.

Read more: Paper: Im­prov­ing Poli­cies via Search in Co­op­er­a­tive Par­tially Ob­serv­able Games

Flo’s opinion: This solu­tion seems stun­ningly ob­vi­ous in ret­ro­spect. While the au­thors in­for­mally re­port that their ap­proach im­proves ro­bust­ness to re­plac­ing other agents by hu­mans, the ex­am­ple they give seems to in­di­cate that this is be­cause search pre­vents ob­vi­ous mis­takes in novel situ­a­tions in­duced by hu­man be­havi­our. Thus, I still ex­pect (im­plicit) hu­man mod­els (AN #52) to be a vi­tal com­po­nent of hu­man-ma­chine co­op­er­a­tion.


Grow­ing Neu­ral Cel­lu­lar Au­tomata (Alexan­der Mord­v­int­sev et al) (sum­ma­rized by Zach): The pro­cess of an or­ganism’s shape de­vel­op­ment (mor­phogen­sis) is an ac­tive area of re­search. One cen­tral prob­lem is de­ter­min­ing how cells de­cide how to grow and when to stop. One pop­u­lar model for in­ves­ti­gat­ing this is Cel­lu­lar Au­tomata (CA). Th­ese model cells as liv­ing on a grid and in­ter­act­ing with each other via rules gen­er­ated by look­ing at their near­est neigh­bors. The au­thors con­tribute to this re­search di­rec­tion by in­tro­duc­ing rule-sets that de­pend con­tin­u­ously on their lo­cal sur­round­ings. The cen­tral in­sight con­nect­ing CA and deep learn­ing is that be­cause the rule-sets are con­stant the up­date rules work similarly to a con­volu­tional filter. This al­lows the au­thors to take ad­van­tage of meth­ods available to train neu­ral net­works to simu­late CA. Us­ing this in­sight, the au­thors train CA that can form into images that are re­sis­tant to per­tur­ba­tions and dele­tions. In other words, the CA are ca­pa­ble of re­gen­er­a­tion.

Zach’s opinion: The main rele­vance of an ap­proach like this is that it pro­vides proof-of-con­cept that com­plex goals, such as shape for­ma­tion, can be pro­grammed in an em­bar­rass­ingly par­allel fash­ion amenable to deep learn­ing method­ol­ogy. This nat­u­rally has im­pli­ca­tions in multi-agent set­tings where com­mu­ni­ca­tion is ex­pen­sive. I’d recom­mend check­ing out the main web app which al­lows you to watch and in­ter­act with the CA while they’re grow­ing. They also have a code repos­i­tory that is eas­ily adapt­able to train­ing on your own pat­terns. For ex­am­ple, I grew a re­gen­er­at­ing Pa­trick Star here.


Gra­di­ent Surgery for Multi-Task Learn­ing (Ti­anhe Yu et al) (sum­ma­rized by Ni­cholas): In multi-task learn­ing, an al­gorithm is given data from mul­ti­ple tasks and tries to learn them all si­mul­ta­neously, ideally shar­ing in­for­ma­tion across them. This pa­per iden­ti­fies a tragic triad of con­di­tions that can pre­vent gra­di­ent de­scent from find­ing a good min­i­mum when all three are pre­sent:

Con­flict­ing gra­di­ents oc­cur when the gra­di­ent from one task points in a differ­ent di­rec­tion from an­other.

Dom­i­nat­ing gra­di­ents oc­cur when the gra­di­ent from one task is much larger in mag­ni­tude than an­other.

High cur­va­ture is when the multi-task cur­va­ture is high in the di­rec­tion of the gra­di­ent.

In this situ­a­tion, the lin­ear ap­prox­i­ma­tion of the gra­di­ent to the high cur­va­ture area leads to an over­es­ti­ma­tion of the in­crease in perfor­mance on the dom­i­nant gra­di­ent’s task and an un­der­es­ti­ma­tion of the perfor­mance degra­da­tion from the con­flict­ing gra­di­ent’s task. I find pic­tur­ing the parabola y=x^2 and see­ing that a gra­di­ent de­scent step over­es­ti­mates progress while a gra­di­ent as­cent step un­der­es­ti­mates to be helpful in un­der­stand­ing this.

To solve this, they pro­pose PCGrad, which pro­jects all gra­di­ents into the nor­mal plane of the oth­ers in a pair­wise fash­ion. Their the­o­ret­i­cal anal­y­sis es­tab­lishes con­ver­gence prop­er­ties of PCGrad, and they em­piri­cally show it can be com­bined with other multi-task al­gorithms to im­prove perfor­mance and that it makes op­ti­miza­tion eas­ier for multi-task su­per­vised learn­ing and RL. They also show plots con­firm­ing that the nec­es­sary con­di­tions for their the­o­rems ap­pear in these con­texts.

Ni­cholas’s opinion: I like how this pa­per an­a­lyzes the loss land­scape of a par­tic­u­lar prob­lem, multi-task learn­ing, and uses that knowl­edge to de­rive a new al­gorithm. One thing I always find tricky in ML pa­pers is that it is hard to es­tab­lish that the the­ory of why an al­gorithm works (typ­i­cally shown on toy mod­els) is also the rea­son it im­proves perfor­mance (typ­i­cally shown us­ing com­plex neu­ral net­works). I ap­pre­ci­ate that this pa­per checks for the con­di­tions of their the­o­rem in the multi-task RL mod­els that they train. That said, I think that in or­der to con­firm that the tragic triad they de­scribe is the mechanism by which PCGrad im­proves perfor­mance, they would re­quire some way to tog­gle each el­e­ment of the triad while keep­ing ev­ery­thing else fixed.


I’m always happy to hear feed­back; you can send it to me, Ro­hin Shah, by re­ply­ing to this email.


An au­dio pod­cast ver­sion of the Align­ment Newslet­ter is available. This pod­cast is an au­dio ver­sion of the newslet­ter, recorded by Robert Miles.