Alignment Newsletter #51

Link post

Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can sign up, or look through this spread­sheet of all sum­maries that have ever been in the newslet­ter.

You may have no­ticed that I’ve been slowly fal­ling be­hind on the newslet­ter, and am now a week be­hind. I would just skip a week and con­tinue—but there are ac­tu­ally a lot of pa­pers and posts that I want to read and sum­ma­rize, and just haven’t had the time. So in­stead, this week you’re go­ing to get two newslet­ters. This one fo­cuses on all of the ML-based work that I have mostly been ig­nor­ing for the past few is­sues.

Highlights

Towards Char­ac­ter­iz­ing Diver­gence in Deep Q-Learn­ing (Joshua Achiam et al): Q-Learn­ing al­gorithms use the Bel­l­man equa­tion to learn the Q*(s, a) func­tion, which is the long-term value of tak­ing ac­tion a in state s. Tab­u­lar Q-Learn­ing col­lects ex­pe­rience and up­dates the Q-value for each (s, a) pair in­de­pen­dently. As long as each (s, a) pair is vis­ited in­finitely of­ten, and the learn­ing rate is de­cayed prop­erly, the al­gorithm is guaran­teed to con­verge to Q*.

Once we get to com­plex en­vi­ron­ments where you can’t enu­mer­ate all of the states, we can’t ex­plore all of the (s, a) pairs. The ob­vi­ous ap­proach is to ap­prox­i­mate Q*(s, a). Deep Q-Learn­ing (DQL) al­gorithms use neu­ral nets for this ap­prox­i­ma­tion, and use some fla­vor of gra­di­ent de­scent to up­date the pa­ram­e­ters of the net such that it is closer to satis­fy­ing the Bel­l­man equa­tion. Un­for­tu­nately, this ap­prox­i­ma­tion can pre­vent the al­gorithm from ever con­verg­ing to Q*.

This pa­per stud­ies the first-or­der Tay­lor ex­pan­sion of the DQL up­date, and iden­ti­fies three fac­tors that af­fect the DQL up­date: the dis­tri­bu­tion of (s, a) pairs from which you learn, the Bel­l­man up­date op­er­a­tor, and the neu­ral tan­gent ker­nel, a prop­erty of the neu­ral net that speci­fies how in­for­ma­tion from one (s, a) pair gen­er­al­izes to other (s, a) pairs. The the­o­ret­i­cal anal­y­sis shows that as long as there is limited gen­er­al­iza­tion be­tween (s, a) pairs, and each (s, a) pair is vis­ited in­finitely of­ten, the al­gorithm will con­verge. In­spired by this, they de­sign PreQN, which ex­plic­itly seeks to min­i­mize gen­er­al­iza­tion across (s, a) pairs within the same batch. They find that PreQN leads to com­pet­i­tive and sta­ble perfor­mance, de­spite not us­ing any of the tricks that DQL al­gorithms typ­i­cally re­quire, such as tar­get net­works.

Ro­hin’s opinion: I re­ally liked this pa­per: it’s a rare in­stance where I ac­tu­ally wanted to read the the­ory in the pa­per be­cause it felt im­por­tant for get­ting the high level in­sight. The the­ory is par­tic­u­larly straight­for­ward and easy to un­der­stand (which usu­ally seems to be true when it leads to high level in­sight). The de­sign of the al­gorithm seems more prin­ci­pled than oth­ers, and the ex­per­i­ments sug­gest that this was ac­tu­ally fruit­ful. The al­gorithm is prob­a­bly more com­pu­ta­tion­ally ex­pen­sive per step com­pared to other al­gorithms, but that could likely be im­proved in the fu­ture.

One thing that felt strange is that the pro­posed solu­tion is ba­si­cally to pre­vent gen­er­al­iza­tion be­tween (s, a) pairs, but the whole point of DQL al­gorithms is to gen­er­al­ize be­tween (s, a) pairs since you can’t get ex­pe­rience from all of them. Of course, since they are only pre­vent­ing gen­er­al­iza­tion within a batch, they still gen­er­al­ize be­tween (s, a) pairs that are not in the batch, but pre­sum­ably that was be­cause they only could pre­vent gen­er­al­iza­tion within the batch. Em­piri­cally the al­gorithm does seem to work, but it’s still not clear to me why it works.

Tech­ni­cal AI alignment

Learn­ing hu­man intent

Deep Re­in­force­ment Learn­ing from Policy-Depen­dent Hu­man Feed­back (Dilip Aru­mugam et al): One ob­vi­ous ap­proach to hu­man-in-the-loop re­in­force­ment learn­ing is to have hu­mans provide an ex­ter­nal re­ward sig­nal that the policy op­ti­mizes. Pre­vi­ous work noted that hu­mans tend to cor­rect ex­ist­ing be­hav­ior, rather than pro­vid­ing an “ob­jec­tive” mea­sure­ment of how good the be­hav­ior is (which is what a re­ward func­tion is). They pro­posed Con­ver­gent Ac­tor-Critic by Hu­mans (COACH), where in­stead of us­ing hu­man feed­back as a re­ward sig­nal, they use it as the ad­van­tage func­tion. This means that hu­man feed­back is mod­eled as spec­i­fy­ing how good an ac­tion is rel­a­tive to the “av­er­age” ac­tion that the agent would have cho­sen from that state. (It’s an av­er­age be­cause the policy is stochas­tic.) Thus, as the policy gets bet­ter, it will no longer get pos­i­tive feed­back on be­hav­iors that it has suc­cess­fully learned to do, which matches how hu­mans give re­in­force­ment sig­nals.

This work takes COACH and ex­tends it to the deep RL set­ting, eval­u­at­ing it on Minecraft. While the origi­nal COACH had an el­i­gi­bil­ity trace that helps “smooth out” hu­man feed­back over time, deep COACH re­quires an el­i­gi­bil­ity re­play buffer. For sam­ple effi­ciency, they first train an au­toen­coder to learn a good rep­re­sen­ta­tion of the space (pre­sum­ably us­ing ex­pe­rience col­lected with a ran­dom policy), and feed these rep­re­sen­ta­tions into the con­trol policy. They re­ward en­tropy so that the policy doesn’t com­mit to a par­tic­u­lar be­hav­ior, mak­ing it re­spon­sive to feed­back, but se­lect ac­tions by always pick­ing the ac­tion with max­i­mal prob­a­bil­ity (rather than sam­pling from the dis­tri­bu­tion) in or­der to have in­ter­pretable, con­sis­tent be­hav­ior for the hu­man train­ers to provide feed­back on. They eval­u­ate on sim­ple nav­i­ga­tion tasks in the com­plex 3D en­vi­ron­ment of Minecraft, in­clud­ing a task where the agent must pa­trol the per­ime­ter of a room, which can­not be cap­tured by a state-based re­ward func­tion.

Ro­hin’s opinion: I re­ally like the fo­cus on figur­ing out how hu­mans ac­tu­ally provide feed­back in prac­tice; it makes a lot of sense that we provide re­in­force­ment sig­nals that re­flect the ad­van­tage func­tion rather than the re­ward func­tion. That said, I wish the eval­u­a­tion had more com­plex tasks, and had in­volved hu­man train­ers who were not au­thors of the pa­per—it might have taken an hour or two of hu­man time in­stead of 10-15 min­utes, but would have been a lot more com­pel­ling.

Be­fore con­tin­u­ing, I recom­mend read­ing about Si­mu­lated Policy Learn­ing in Video Models be­low. As in that case, I think that you get sam­ple effi­ciency here by get­ting a lot of “su­per­vi­sion in­for­ma­tion” from the pix­els used to train the VAE, though in this case it’s by learn­ing use­ful fea­tures rather than us­ing the world model to simu­late tra­jec­to­ries. (Im­por­tantly, in this set­ting we care about sam­ple effi­ciency with re­spect to hu­man feed­back as op­posed to en­vi­ron­ment in­ter­ac­tion.) I think the tech­niques used there could help with scal­ing to more com­plex tasks. In par­tic­u­lar, it would be in­ter­est­ing to see a var­i­ant of deep COACH that al­ter­nated be­tween train­ing the VAE with the learned con­trol policy, and train­ing the learned con­trol policy with the new VAE fea­tures. One is­sue would be that as you re­train the VAE, you would in­val­i­date your pre­vi­ous con­trol policy, but you could prob­a­bly get around that (e.g. by also train­ing the con­trol policy to imi­tate it­self while the VAE is be­ing trained).

From Lan­guage to Goals: In­verse Re­in­force­ment Learn­ing for Vi­sion-Based In­struc­tion Fol­low­ing (Justin Fu et al): Re­wards and lan­guage com­mands are more gen­er­al­iz­able than poli­cies: “pick up the vase” would make sense in any house, but the ac­tions that nav­i­gate to and pick up a vase in one house would not work in an­other house. Based on this ob­ser­va­tion, this pa­per pro­poses that we have a dataset where for sev­eral (lan­guage com­mand, en­vi­ron­ment) pairs, we are given ex­pert demon­stra­tions of how to fol­low the com­mand in that en­vi­ron­ment. For each data point, we can use IRL to in­fer a re­ward func­tion, and use that to train a neu­ral net that can map from the lan­guage com­mand to the re­ward func­tion. Then, at test time, given a lan­guage com­mand, we can con­vert it to a re­ward func­tion, af­ter which we can use stan­dard deep RL tech­niques to get a policy that ex­e­cutes the com­mand.

The au­thors eval­u­ate on a 3D house do­main with pixel ob­ser­va­tions, and two types of lan­guage com­mands: nav­i­ga­tion and pick-and-place. Dur­ing train­ing, when IRL needs to be done, since deep IRL al­gorithms are com­pu­ta­tion­ally ex­pen­sive they con­vert the task into a small, tab­u­lar MDP with known dy­nam­ics for which they can solve the IRL prob­lem ex­actly, de­riv­ing a gra­di­ent that can then be ap­plied in the ob­ser­va­tion space to train a neu­ral net that given image ob­ser­va­tions and a lan­guage com­mand pre­dicts the re­ward. Note that this only needs to be done at train­ing time: at test time, the re­ward func­tion can be used in a new en­vi­ron­ment with un­known dy­nam­ics and image ob­ser­va­tions. They show that the learned re­wards gen­er­al­ize to novel com­bi­na­tions of ob­jects within a house, as well as to en­tirely new houses (though to a lesser ex­tent).

Ro­hin’s opinion: I think the suc­cess at gen­er­al­iza­tion comes pri­mar­ily be­cause of the MaxEnt IRL dur­ing train­ing: it pro­vides a lot of struc­ture and in­duc­tive bias that means that the re­wards on which the re­ward pre­dic­tor is trained are “close” to the in­tended re­ward func­tion. For ex­am­ple, in the nav­i­ga­tion tasks, the demon­stra­tions for a com­mand like “go to the vase” will in­volve tra­jec­to­ries through the state of many houses that end up in the vase. For each demon­stra­tion, MaxEnt IRL “as­signs” pos­i­tive re­ward to the states in the demon­stra­tion, and nega­tive re­ward to ev­ery­thing else. How­ever, once you av­er­age across demon­stra­tions in differ­ent houses, the state with the vase gets a huge amount of pos­i­tive re­ward (since it is in all tra­jec­to­ries) while all the other states are rel­a­tively neu­tral (since they will only be in a few tra­jec­to­ries, where the agent needed to pass that point in or­der to get to the vase). So when this is “trans­ferred” to the neu­ral net via gra­di­ents, the neu­ral net is ba­si­cally “told” that high re­ward only hap­pens in states that con­tain vases, which is a strong con­straint on the learned re­ward.

To be clear, this is not meant as a cri­tique of the pa­per: in­deed, I think when you want out-of-dis­tri­bu­tion gen­er­al­iza­tion, you have to do it by im­pos­ing struc­ture/​in­duc­tive bias, and this is a new way to do it that I hadn’t seen be­fore.

Us­ing Nat­u­ral Lan­guage for Re­ward Shap­ing in Re­in­force­ment Learn­ing (Pra­soon Goyal et al): This pa­per con­structs a dataset for ground­ing nat­u­ral lan­guage in Atari games, and uses it to im­prove perfor­mance on Atari. They have hu­mans an­no­tate short clips with nat­u­ral lan­guage: for ex­am­ple, “jump over the skull while go­ing to the left” in Mon­tezuma’s Re­venge. They use this to build a model that pre­dicts whether a given tra­jec­tory matches a nat­u­ral lan­guage in­struc­tion. Then, while train­ing an agent to play Atari, they have hu­mans give the AI sys­tem an in­struc­tion in nat­u­ral lan­guage. They use their nat­u­ral lan­guage model to pre­dict the prob­a­bil­ity that the tra­jec­tory matches the in­struc­tion, and add that as an ex­tra shap­ing term in the re­ward. This leads to faster learn­ing.

ProLoNets: Neu­ral-en­cod­ing Hu­man Ex­perts’ Do­main Knowl­edge to Warm Start Re­in­force­ment Learn­ing (An­drew Silva et al)

Interpretability

Vi­su­al­iz­ing mem­o­riza­tion in RNNs (An­dreas Mad­sen): This is a short Distill ar­ti­cle that show­cases a vi­su­al­iza­tion tool that demon­strates how con­tex­tual in­for­ma­tion is used by var­i­ous RNN units (LSTMs, GRUs, and nested LSTMs). The method is very sim­ple: for each char­ac­ter in the con­text, they high­light the char­ac­ter in pro­por­tion to the gra­di­ent of the log­its with re­spect to that char­ac­ter. Look­ing at this vi­su­al­iza­tion al­lows us to see that GRUs are bet­ter at us­ing long-term con­text, while LSTMs perform bet­ter for short-term con­texts.

Ro­hin’s opinion: I’d recom­mend you ac­tu­ally look at and play around with the vi­su­al­iza­tion, it’s very nice. The sum­mary is short be­cause the value of the work is in the vi­su­al­iza­tion, not in the tech­ni­cal de­tails.

Other progress in AI

Exploration

Learn­ing Ex­plo­ra­tion Poli­cies for Nav­i­ga­tion (Tao Chen et al)

Deep Re­in­force­ment Learn­ing with Feed­back-based Ex­plo­ra­tion (Jan Scholten et al)

Re­in­force­ment learning

Towards Char­ac­ter­iz­ing Diver­gence in Deep Q-Learn­ing (Joshua Achiam et al): Sum­ma­rized in the high­lights!

Eigh­teen Months of RL Re­search at Google Brain in Mon­treal (Marc Bel­le­mare): One ap­proach to re­in­force­ment learn­ing is to pre­dict the en­tire dis­tri­bu­tion of re­wards from tak­ing an ac­tion, in­stead of pre­dict­ing just the ex­pected re­ward. Em­piri­cally, this works bet­ter, even though in both cases we choose the ac­tion with high­est ex­pected re­ward. This blog post pro­vides an overview of work at Google Brain Mon­treal that at­tempts to un­der­stand this phe­nomenon. I’m only sum­ma­riz­ing the part that most in­ter­ested me.

First, they found that in the­ory, dis­tri­bu­tional RL performs on par with or worse than stan­dard RL when us­ing ei­ther a tab­u­lar rep­re­sen­ta­tion or lin­ear fea­tures. They then tested this em­piri­cally on Cart­pole, and found similar re­sults: dis­tri­bu­tional RL performed worse when us­ing tab­u­lar or lin­ear rep­re­sen­ta­tions, but bet­ter when us­ing a deep neu­ral net. This sug­gests that dis­tri­bu­tional RL “learns bet­ter rep­re­sen­ta­tions”. So, they vi­su­al­ize rep­re­sen­ta­tions for RL on the four-room en­vi­ron­ment, and find that dis­tri­bu­tional RL cap­tures more struc­tured rep­re­sen­ta­tions. Similarly this pa­per showed that pre­dict­ing value func­tions for mul­ti­ple dis­count rates is an effec­tive way to pro­duce aux­iliary tasks for Atari.

Ro­hin’s opinion: This is a re­ally in­ter­est­ing mys­tery with deep RL, and af­ter read­ing this post I have a story for it. Note I am far from an ex­pert in this field and it’s quite plau­si­ble that if I read the pa­pers cited in this post I could tell this story is false, but here’s the story any­way. As we saw with PreQN ear­lier in this is­sue, one of the most im­por­tant as­pects of deep RL is how in­for­ma­tion about one (s, a) pair is used to gen­er­al­ize to other (s, a) pairs. I’d guess that the benefit from dis­tri­bu­tional RL is pri­mar­ily that you get “good rep­re­sen­ta­tions” that let you do this gen­er­al­iza­tion well. With a tab­u­lar rep­re­sen­ta­tion you don’t do any gen­er­al­iza­tion, and with a lin­ear fea­ture space the rep­re­sen­ta­tion is hand-de­signed by hu­mans to do this gen­er­al­iza­tion well, so dis­tri­bu­tional RL doesn’t help in those cases.

But why does dis­tri­bu­tional RL learn good rep­re­sen­ta­tions? I claim that it pro­vides stronger su­per­vi­sion given the same amount of ex­pe­rience. With nor­mal ex­pected RL, the fi­nal layer of the neu­ral net need only be use­ful for pre­dict­ing the ex­pected re­ward, but with dis­tri­bu­tional RL they must be use­ful for pre­dict­ing all of the quan­tiles of the re­ward dis­tri­bu­tion. There may be “short­cuts” or “heuris­tics” that al­low you to pre­dict ex­pected re­ward well be­cause of spu­ri­ous cor­re­la­tions in your en­vi­ron­ment, but it’s less likely that those heuris­tics work well for all of the quan­tiles of the re­ward dis­tri­bu­tion. As a re­sult, hav­ing to pre­dict more things en­forces a stronger con­straint on what rep­re­sen­ta­tions your neu­ral net must have, and thus you are more likely to find good rep­re­sen­ta­tions. This per­spec­tive also ex­plains why pre­dict­ing value func­tions for mul­ti­ple dis­count rates helps with Atari, and why adding aux­iliary tasks is of­ten helpful (as long as the aux­iliary task is rele­vant to the main task).

The im­por­tant as­pect here is that all of the quan­tiles are forc­ing the same neu­ral net to learn good rep­re­sen­ta­tions. If you in­stead have differ­ent neu­ral nets pre­dict­ing each quan­tile, each neu­ral net has roughly the same amount of su­per­vi­sion as in ex­pected RL, so I’d ex­pect that to work about as well as ex­pected RL, maybe a lit­tle worse since quan­tiles are prob­a­bly harder to pre­dict than means. If any­one ac­tu­ally runs this ex­per­i­ment, please do let me know the re­sult!

Di­ag­nos­ing Bot­tle­necks in Deep Q-learn­ing Al­gorithms (Justin Fu, Aviral Ku­mar et al): While the PreQN pa­per used a the­o­ret­i­cal ap­proach to tackle Deep Q-Learn­ing al­gorithms, this one takes an em­piri­cal ap­proach. Their re­sults:

- Small neu­ral nets can­not rep­re­sent Q*, and so have un­de­sired bias that re­sults in worse perfor­mance. How­ever, they also have con­ver­gence is­sues, where the Q-func­tion they ac­tu­ally con­verge to is sig­nifi­cantly worse than the best Q-func­tion that they could ex­press. Larger ar­chi­tec­tures miti­gate both of these prob­lems.

- When there are more sam­ples, we get a lower val­i­da­tion loss, show­ing that we are overfit­ting. De­spite this, larger ar­chi­tec­tures are bet­ter, be­cause the perfor­mance loss from overfit­ting is not as bad as the perfor­mance loss from hav­ing a bad bias. A good early stop­ping crite­rion could help with this.

- To study how non-sta­tion­ar­ity af­fects DQL al­gorithms, they study a var­i­ant where the Q-func­tion is a mov­ing av­er­age of the past Q-func­tions (in­stead of the full up­date), which means that the tar­get val­ues don’t change as quickly (i.e. it is closer to a sta­tion­ary tar­get). They find that non-sta­tion­ar­ity doesn’t mat­ter much for large ar­chi­tec­tures.

- To study dis­tri­bu­tion shift, they look at the differ­ence be­tween the ex­pected Bel­l­man er­ror be­fore and af­ter an up­date to the pa­ram­e­ters. They find that dis­tri­bu­tion shift doesn’t cor­re­late much with perfor­mance and so is likely not im­por­tant.

- Al­gorithms differ strongly in the dis­tri­bu­tion over (s, a) pairs that the DQL up­date is com­puted over. They study this in the ab­sence of sam­pling (i.e. when they sim­ply weight all pos­si­ble (s, a) pairs, rather than just the ones sam­pled from a policy) and find that dis­tri­bu­tions that are “close to uniform” perform best. They hy­poth­e­size that this is the rea­son that ex­pe­rience re­play helps—ini­tially an on-policy al­gorithm would take sam­ples from a sin­gle policy, while ex­pe­rience re­play adds sam­ples from pre­vi­ous ver­sions of the policy, which should in­crease the cov­er­age of (s, a) pairs.

To sum up, the im­por­tant fac­tors are us­ing an ex­pres­sive neu­ral net ar­chi­tec­ture, and de­sign­ing a good sam­pling dis­tri­bu­tion. In­spired by this, they de­sign Ad­ver­sar­ial Fea­ture Match­ing (AFM), which like Pri­ori­tized Ex­pe­rience Re­play (PER) puts more weight on sam­ples that have high Bel­l­man er­ror. How­ever, un­like PER, AFM does not try to re­duce dis­tri­bu­tion shift via im­por­tance sam­pling, since their ex­per­i­ments found that this was not im­por­tant.

Ro­hin’s opinion: This is a great ex­per­i­men­tal pa­per, there’s a lot of data that can help un­der­stand DQL al­gorithms. I wouldn’t take the re­sults too liter­ally, since in­sights on sim­ple en­vi­ron­ments may not gen­er­al­ize to more com­plex en­vi­ron­ments. For ex­am­ple, they found overfit­ting to be an is­sue in their en­vi­ron­ments—it’s plau­si­ble to me that with more com­plex en­vi­ron­ments (think Dota/​StarCraft, not Mu­joco) this re­verses and you end up un­der­fit­ting the data you have. Nonethe­less, I think data like this is par­tic­u­larly valuable for com­ing up with an in­tu­itive the­ory of how deep RL works, if not a for­mal one.

Si­mu­lated Policy Learn­ing in Video Models (Lukasz Kaiser, Mo­ham­mad Babaeizadeh, Piotr Miłos, Błazej Os­in­ski et al): This blog post and the as­so­ci­ated pa­per tackle model-based RL for Atari. The re­cent world mod­els (AN #23) pa­per pro­posed first learn­ing a model of the world by in­ter­act­ing with the en­vi­ron­ment us­ing a ran­dom policy, and then us­ing the model to simu­late the en­vi­ron­ment and train­ing a con­trol policy us­ing those simu­la­tions. (This wasn’t it’s main point, but it was one of the things it talked about.) The au­thors take this idea and put it in an iter­a­tive loop: they first train the world model us­ing ex­pe­rience from a ran­dom policy, then train a policy us­ing the world model, re­train the world model with ex­pe­rience col­lected us­ing the newly trained policy, re­train the policy, and so on. This al­lows us to cor­rect any mis­takes in the world model and let it adapt to novel situ­a­tions that the con­trol policy dis­cov­ers. This al­lows them to train agents that can play Atari with only 100K in­ter­ac­tions with the en­vi­ron­ment (cor­re­spond­ing to about two hours of real-time game­play), though the fi­nal perfor­mance is lower than the state-of-the-art achieved with model-free RL. See Im­port AI for more de­tails.

Ro­hin’s opinion: This work fol­lows the stan­dard pat­tern where model-based RL is more sam­ple effi­cient but reaches worse fi­nal perfor­mance com­pared to model-free RL. Let’s try to ex­plain this us­ing the same story as in the rest of this newslet­ter.

The sam­ple effi­ciency comes from the fact that they learn a world model that can pre­dict the fu­ture, and then use that model to solve the con­trol prob­lem (which has zero sam­ple cost, since you are no longer in­ter­act­ing with the en­vi­ron­ment). It turns out that pre­dict­ing the fu­ture is “eas­ier” than se­lect­ing the op­ti­mal ac­tion, and so the world model can be trained in fewer sam­ples than it would take to solve the con­trol prob­lem di­rectly. Why is the world model “eas­ier” to learn? One pos­si­bil­ity is that solv­ing the con­trol prob­lem re­quires you to model the world any­way, and so must be a harder prob­lem. If you don’t know what your ac­tions are go­ing to do, you can’t choose the best one. I don’t find this very com­pel­ling, since there are lots of as­pects of world mod­el­ing that are ir­rele­vant to the con­trol prob­lem—you don’t need to know ex­actly how the back­ground art will change in or­der to choose what ac­tion to take, but world mod­el­ing re­quires you to do this. I think the real rea­son is that world mod­el­ing benefits from much more su­per­vi­sion—rather than get­ting a sparse re­ward sig­nal over a tra­jec­tory, you get a full grid of pix­els ev­ery timestep that you were sup­posed to pre­dict. This gives you many or­ders of mag­ni­tude more “su­per­vi­sion in­for­ma­tion” per sam­ple, and so it makes it eas­ier to learn. (This is ba­si­cally the same ar­gu­ment as in Yann Le­cun’s cake anal­ogy.)

Why does it lead to worse perfor­mance over­all? The policy is now be­ing trained us­ing rol­louts that are sub­tly wrong, and so in­stead of spe­cial­iz­ing to the true Atari dy­nam­ics it will be spe­cial­ized to the world model dy­nam­ics, which is go­ing to be some­what differ­ent and should lead to a slight dip in perfor­mance. (Imag­ine a bas­ket­ball player hav­ing to shoot a ball that was a bit heav­ier than usual—she’ll prob­a­bly still be good, but not as good as with a reg­u­lar bas­ket­ball.) In ad­di­tion, since the world model is su­per­vised by pix­els, any small ob­jects are not very im­por­tant to the world model (i.e. get­ting them wrong does not in­cur much loss), even if they are very im­por­tant for con­trol. In fact, they find that bul­lets tend to dis­ap­pear in At­lantis and Bat­tle Zone, which is not good if you want to learn to play those games.

I’m not sure if they shared weights be­tween the world model and the con­trol policy. If they did, then they would also have the prob­lem that the fea­tures that are use­ful for pre­dict­ing the fu­ture are not the same as the fea­tures that are use­ful for se­lect­ing ac­tions, which would also cause a drop in perfor­mance. My guess is that they didn’t share weights for pre­cisely this rea­son, but I’m not sure.

Read more: Model-Based Re­in­force­ment Learn­ing for Atari

Unify­ing Physics and Deep Learn­ing with Toss­ingBot (Andy Zeng): Toss­ingBot is a sys­tem that learns how to pick up and toss ob­jects into bins us­ing deep RL. The most in­ter­est­ing thing about it is that in­stead of us­ing neu­ral nets to di­rectly pre­dict ac­tions, they are in­stead used to pre­dict ad­just­ments to ac­tions that are com­puted by a physics-based con­trol­ler. Since the physics-based con­trol­ler gen­er­al­izes well to new situ­a­tions, Toss­ingBot is also able to gen­er­al­ize to new toss­ing lo­ca­tions.

Ro­hin’s opinion: This is a cool ex­am­ple of us­ing struc­tured knowl­edge in or­der to get gen­er­al­iza­tion while also us­ing deep learn­ing in or­der to get perfor­mance. I also re­cently came across Resi­d­ual Re­in­force­ment Learn­ing for Robot Con­trol, which seems to have the same idea of com­bin­ing deep RL with con­ven­tional con­trol mechanisms. I haven’t read ei­ther of the pa­pers in depth, so I can’t com­pare them, but a very brief skim sug­gests that their tech­niques are sig­nifi­cantly differ­ent.

Effi­cient Off-Policy Meta-Re­in­force­ment Learn­ing via Prob­a­bil­is­tic Con­text Vari­ables (Kate Rakelly, Aurick Zhou et al)

Deep learning

Mea­sur­ing the Limits of Data Par­allel Train­ing for Neu­ral Net­works (Chris Shal­lue and Ge­orge Dahl): Con­sider the re­la­tion­ship be­tween the size of a sin­gle batch and the num­ber of batches needed to reach a spe­cific perfor­mance bound when us­ing deep learn­ing. If all that mat­tered for perfor­mance was the to­tal num­ber of ex­am­ples that you take gra­di­ent steps on (i.e. the product of these two num­bers), then you would ex­pect a perfect in­verse re­la­tion­ship be­tween these two quan­tities, which would look like a line with nega­tive slope on a log-log plot. In this case, we could scale batch sizes up ar­bi­trar­ily far, and dis­tribute them across as many ma­chines as nec­es­sary, in or­der to re­duce wall clock train­ing time. A 2x in­crease in batch size with twice as many ma­chines would lead to a 2x de­crease in train­ing time. How­ever, as you make batch sizes re­ally large, you face the prob­lem of stale gra­di­ents: if you had up­dated on the first half of the batch and then com­puted gra­di­ents on the sec­ond half of the batch, the gra­di­ents for the sec­ond half would be “bet­ter”, be­cause they were com­puted with re­spect to a bet­ter set of pa­ram­e­ters. When this effect be­comes sig­nifi­cant, you no longer get the nice lin­ear scal­ing from par­alleliza­tion.

This post stud­ies the re­la­tion­ship em­piri­cally across a num­ber of datasets, ar­chi­tec­tures, and op­ti­miza­tion al­gorithms. They find that uni­ver­sally, there is ini­tially an era of perfect lin­ear scal­ing as you in­crease batch size, fol­lowed by a re­gion of diminish­ing marginal re­turns that ul­ti­mately leads to an asymp­tote where in­creas­ing batch size doesn’t help at all with re­duc­ing wall-clock train­ing time. How­ever, the tran­si­tion points be­tween these regimes vary wildly, sug­gest­ing that there may be low hang­ing fruit in the de­sign of al­gorithms or ar­chi­tec­tures that ex­plic­itly aim to achieve very good scal­ing.

Ro­hin’s opinion: OpenAI found (AN #37) that the best pre­dic­tor of the max­i­mum use­ful batch size was how noisy the gra­di­ent is. Pre­sum­ably when you have noisy gra­di­ents, a larger batch size helps “av­er­age out” the noise across ex­am­ples. Reread­ing their post, I no­tice that they men­tioned the study I’ve sum­ma­rized here and said that their re­sults can help ex­plain why there’s so much var­i­ance in the tran­si­tion points across datasets. How­ever, I don’t think it can ex­plain the var­i­ance in tran­si­tion points across ar­chi­tec­tures. Noisy gra­di­ents are typ­i­cally a sig­nifi­cant prob­lem, and so it would be weird if the var­i­ance in tran­si­tion points across ar­chi­tec­tures were ex­plained by the nois­i­ness of the gra­di­ent: that would im­ply that two ar­chi­tec­tures reach the same fi­nal perfor­mance even though one had the prob­lem of noisy gra­di­ents while the other didn’t. So there seems to be some­thing left to ex­plain here.

That said, I haven’t looked in depth at the data, so the ex­pla­na­tion could be very sim­ple. For ex­am­ple, maybe the tran­si­tion points don’t vary much across ar­chi­tec­ture and vary much more across datasets, and the var­i­ance across ar­chi­tec­ture is small enough that its effect on perfor­mance is dwar­fed by all the other things that can af­fect the perfor­mance of deep learn­ing sys­tems. Or per­haps while the nois­i­ness of the gra­di­ent is a good pre­dic­tor of the max­i­mum batch size, it still only ex­plains say 40% of the effect, and so var­i­ance across ar­chi­tec­tures is to­tally com­pat­i­ble with fac­tors other than the gra­di­ent noise af­fect­ing the max­i­mum batch size.