# [AN #92]: Learning good representations with contrastive predictive coding

Newslet­ter #92

Align­ment Newslet­ter is a weekly pub­li­ca­tion with re­cent con­tent rele­vant to AI al­ign­ment around the world. Find all Align­ment Newslet­ter re­sources here. In par­tic­u­lar, you can look through this spread­sheet of all sum­maries that have ever been in the newslet­ter.

Au­dio ver­sion here (may not be up yet).

# HIGHLIGHTS

Rep­re­sen­ta­tion Learn­ing with Con­trastive Pre­dic­tive Cod­ing (Aaron van den Oord et al) (sum­ma­rized by Ro­hin): This pa­per from 2018 pro­posed Con­trastive Pre­dic­tive Cod­ing (CPC): a method of un­su­per­vised learn­ing that has been quite suc­cess­ful. At its core it is quite sim­ple: it sim­ply com­bines the ideas of pre­dic­tive cod­ing and con­trastive losses, both of which have been sig­nifi­cantly stud­ied in the past.

The sim­plest form of un­su­per­vised learn­ing would be data com­pres­sion via gen­er­a­tive mod­els (as in e.g. VAEs), in which, to model the data p(x), you at­tempt to en­code x into a la­tent (hid­den) state z in such a way that you can then re­cover the origi­nal data point x from z. In­tu­itively, we want z to have high mu­tual in­for­ma­tion with x.

For se­quen­tial data in a par­tially ob­served set­ting, you need to deal with the full se­quence. Con­sider nat­u­ral lan­guage: in this set­ting, each x would be a sin­gle word. Con­sider the sen­tence “I sat on the chair”. If the z cor­re­spond­ing to the word “the” only has to re­con­struct the word “the”, it’s not go­ing to “re­mem­ber” that the past con­text in­volved sit­ting, and so that z would be ter­rible at pre­dict­ing that the next word will be chair. To fix this, we can use pre­dic­tive cod­ing, where we in­stead re­quire that we can pre­dict fu­ture words us­ing z. This now in­cen­tivizes z_t to have high mu­tual in­for­ma­tion with x_{t+k}.

There is still a prob­lem: re­con­struct­ing the en­tire in­put x would re­quire a lot of ir­rele­vant in­for­ma­tion, such as e.g. the back­ground color of the en­vi­ron­ment in RL, even if that never changes. How can we get rid of these ir­rele­vant fea­tures? Con­trastive losses al­low us to do this: in­tu­itively, since the ir­rele­vant fea­tures are the ones that are com­mon across all the xs (and so are fully cap­tured by p(x) ), if we train the neu­ral net to dis­t­in­guish be­tween var­i­ous xs, we can in­cen­tivize only the rele­vant fea­tures. In par­tic­u­lar, given a la­tent state z_t, we take the true x_{t+k}, and throw in a bunch of other xs sam­pled from p(x) (known as nega­tive sam­ples), and train the net­work to cor­rectly clas­sify x_{t+k}. The au­thors show that the op­ti­mum of this loss func­tion is in­deed for the neu­ral net to com­pute p(x | z) /​ p(x), which im­plies that it is max­i­miz­ing a lower bound on the mu­tual in­for­ma­tion be­tween X and Z.

This gives us a pretty sim­ple over­all al­gorithm. Take a se­quence x_1 … x_T, com­pute z_t us­ing a re­cur­rent model on x_1 … x_t, put x_{t+k} and some nega­tive sam­ples into a set, and train a clas­sifier to cor­rectly pre­dict which of the sam­ples is the true x_{t+k}. In prac­tice, we do batches of these at the same time, and for ev­ery data point in the batch we use all of the other data points as our nega­tive ex­am­ples. The fea­tures you learn are then the ones that help dis­t­in­guish be­tween x_{t+k} and the nega­tive sam­ples, and you’ll ig­nore any fea­tures that are com­mon across all the sam­ples. This means that the re­sults de­pend quite a lot on how you choose your sam­ples (this effec­tively de­ter­mines what p(x) you are us­ing).

The au­thors eval­u­ate their al­gorithm on sev­eral do­mains and show that it achieves or sur­passes state of the art on them.

Ro­hin’s opinion: I like this pa­per: the in­tu­ition makes sense, the math is straight­for­ward, and the em­piri­cal re­sults are strong, and have con­tinued to be strong when look­ing at later work that builds on it.

On Vari­a­tional Bounds of Mu­tual In­for­ma­tion (Ben Poole et al) (sum­ma­rized by Ro­hin): This pa­per is a pretty dense and tech­ni­cal ex­pla­na­tion of var­i­ous ways in which we can es­ti­mate and/​or op­ti­mize the mu­tual in­for­ma­tion be­tween two vari­ables. I speci­fi­cally want to high­light that it pro­vides a proof that the Con­trastive Pre­dic­tive Cod­ing ob­jec­tive (sum­ma­rized above) is a lower bound on the mu­tual in­for­ma­tion be­tween the in­put and the rep­re­sen­ta­tion, and com­pares it to other lower bounds on mu­tual in­for­ma­tion.

# TECHNICAL AI ALIGNMENT

## TECHNICAL AGENDAS AND PRIORITIZATION

An An­a­lytic Per­spec­tive on AI Align­ment (Daniel Filan) (sum­ma­rized by Asya): In this post, Daniel Filan pre­sents an an­a­lytic per­spec­tive on how to do use­ful AI al­ign­ment re­search. His take is that in a world with pow­er­ful AGI sys­tems similar to neu­ral net­works, it may be suffi­cient to be able to de­tect whether a sys­tem would cause bad out­comes be­fore you de­ploy it on real-world sys­tems with un­known dis­tri­bu­tions. To this end, he ad­vo­cates for work on trans­parency that gives mechanis­tic un­der­stand­ings (AN #15) of the sys­tems in ques­tion, com­bined with foun­da­tional re­search that al­lows us to rea­son about the safety of the pro­duced un­der­stand­ings.

Ro­hin’s opinion: My broad take is that I agree that an­a­lyz­ing neu­ral nets is use­ful and more work should go into it, but I broadly dis­agree that this leads to re­duced x-risk by in­creas­ing the like­li­hood that de­vel­op­ers can look at their trained model, de­ter­mine whether it is dan­ger­ous by un­der­stand­ing it mechanis­ti­cally, and de­cide whether to de­ploy it, in a “zero-shot” way. The key difficulty here is the mechanis­tic trans­parency, which seems like far too strong a prop­erty for us to aim for: I would ex­pect the cost of mak­ing a neu­ral net­work mechanis­ti­cally trans­par­ent to far ex­ceed the cost of train­ing that neu­ral net­work in the first place, and so it would be hard to get de­vel­op­ers to mechanis­ti­cally un­der­stand trained mod­els to de­tect dan­ger.

Right now for e.g. image clas­sifiers, some peo­ple on OpenAI’s Clar­ity team have spent mul­ti­ple years un­der­stand­ing a sin­gle image clas­sifier, which is or­ders of mag­ni­tude more ex­pen­sive than train­ing the clas­sifier. My guess is that this will be­come su­per­lin­early harder as mod­els get big­ger (and es­pe­cially as mod­els be­come su­per­hu­man), and so it seems quite un­likely that we could have mechanis­tic trans­parency for very com­plex AGI sys­tems built out of neu­ral nets. More de­tails in this com­ment. Note that Daniel agrees that it is an open ques­tion whether this sort of mechanis­tic trans­parency is pos­si­ble, and thinks that we don’t have much ev­i­dence yet that it isn’t.

## ROBUSTNESS

The Con­di­tional En­tropy Bot­tle­neck (Ian Fischer) (sum­ma­rized by Ro­hin): While I’ve cat­e­go­rized this pa­per un­der ro­bust­ness be­cause it can ap­ply to most forms of train­ing, I’ll talk about it speci­fi­cally in the con­text of un­su­per­vised learn­ing (and in par­tic­u­lar its re­la­tion to Con­trastive Pre­dic­tive Cod­ing (CPC), sum­ma­rized in the high­lights).

One po­ten­tial prob­lem with deep learn­ing is that there might be too much in­for­ma­tion in the in­put, caus­ing the model to learn spu­ri­ous cor­re­la­tions that do not ac­tu­ally gen­er­al­ize well (see Causal Con­fu­sion in Imi­ta­tion Learn­ing (AN #79) as an ex­am­ple). The idea with the Con­di­tional En­tropy Bot­tle­neck (CEB) is to pe­nal­ize the model for learn­ing ir­rele­vant in­for­ma­tion, us­ing a form of in­for­ma­tion bot­tle­neck.

We con­sider a set­ting where we want to learn a rep­re­sen­ta­tion Z of some in­put data X in or­der to pre­dict some down­stream data Y. In CPC, X would be the in­puts from time 1 to t, Z would be the la­tent rep­re­sen­ta­tion z_t, and Y would be the fu­ture data x_{t+k}. Then, we want Z to cap­ture the min­i­mum nec­es­sary in­for­ma­tion needed for Z to pre­dict Y as best as pos­si­ble. The nec­es­sary in­for­ma­tion is I(Y; Z), that is, the mu­tual in­for­ma­tion be­tween Z and Y: we want to max­i­mize this to max­i­mize our ac­cu­racy at pre­dict­ing Y. Since Y de­pends on X, and Z is com­puted from X, any in­for­ma­tion about Y must come through mu­tual in­for­ma­tion be­tween X and Z. Max­i­miz­ing just this I(Y; Z) term gives us Con­trastive Pre­dic­tive Cod­ing.

How­ever, we don’t want to cap­ture any ex­tra ir­rele­vant in­for­ma­tion (the min­i­mal­ity crite­rion), which means that Z shouldn’t cap­ture any more in­for­ma­tion about X be­yond what it cap­tured to max­i­mize I(Y; Z). In in­for­ma­tion-the­o­retic terms, we want to min­i­mize I(X; Z | Y). Thus, we have the CEB ob­jec­tive: min­i­miz­ing I(X; Z | Y) - γ I(Y; Z), where γ is a hy­per­pa­ram­e­ter con­trol­ling the trade­off be­tween the two terms. The au­thors then use some fairly straight­for­ward math to re­duce the ob­jec­tive to sim­pler terms which can be bounded us­ing vari­a­tional ap­prox­i­ma­tions, lead­ing to an al­gorithm that can work in prac­tice.

The au­thors perform ex­per­i­ments on Fash­ion MNIST and CIFAR10 (where Y cor­re­sponds to the la­bels for the images, so we’re in the su­per­vised learn­ing set­ting). Since the main benefit of CEB is to re­move un­nec­es­sary in­for­ma­tion from the model, they eval­u­ate ad­ver­sar­ial ro­bust­ness and out-of-dis­tri­bu­tion de­tec­tion in ad­di­tion to stan­dard perfor­mance checks. They find that mod­els trained with CEB perform bet­ter than ones trained with a vari­a­tional in­for­ma­tion bot­tle­neck, or ones trained with vanilla SGD.

Ro­hin’s opinion: While I’m not sure to what ex­tent mod­els learn truly ir­rele­vant in­for­ma­tion (see Ad­ver­sar­ial Ex­am­ples Are Not Bugs, They Are Fea­tures (AN #62)), it seems good to add an in­cen­tive against learn­ing in­for­ma­tion that won’t be use­ful for a down­stream task, and the em­piri­cal re­sults (es­pe­cially of the next pa­per) sug­gest that it is pro­vid­ing some benefit.

CEB Im­proves Model Ro­bust­ness (Ian Fischer et al) (sum­ma­rized by Ro­hin): This em­piri­cal pa­per finds that ImageNet clas­sifiers trained with the CEB ob­jec­tive (sum­ma­rized above) are already some­what ad­ver­sar­i­ally ro­bust, with­out hav­ing any de­crease in ac­cu­racy, and with­out any ad­ver­sar­ial train­ing. Notably, since CEB does not rely on know­ing the at­tack method ahead of time, its ad­ver­sar­ial ro­bust­ness gen­er­al­izes to mul­ti­ple kinds of at­tacks, whereas mod­els that were ad­ver­sar­i­ally trained tend to be frag­ile in the face of pre­vi­ously un­seen at­tacks.

# OTHER PROGRESS IN AI

## REINFORCEMENT LEARNING

Illu­mi­nat­ing Gen­er­al­iza­tion in Deep Re­in­force­ment Learn­ing through Pro­ce­du­ral Level Gen­er­a­tion (Niels Juste­sen et al) (sum­ma­rized by Zach): Deep re­in­force­ment learn­ing has been able to use high-di­men­sional in­put, such as images, to learn op­ti­mal poli­cies. How­ever, when neu­ral net­works are trained in a fixed en­vi­ron­ment, such as on a sin­gle level in a video game, they will usu­ally over-fit and fail to gen­er­al­ize to new lev­els. This pa­per uses pro­ce­du­rally gen­er­ated lev­els dur­ing train­ing in an at­tempt to in­crease the gen­er­al­ity of deep RL. They make use of the Gen­eral Video Game AI frame­work (GVG-AI) which al­lows rapid de­sign of video games through the speci­fi­ca­tion of re­wards, ob­jects, etc. More­over, they in­tro­duce Pro­gres­sive PCG (PPCG) to smoothly con­trol the difficulty of gen­er­ated lev­els to build a cur­ricu­lum for the agent. The au­thors show that for some games pro­ce­du­ral level gen­er­a­tion en­ables gen­er­al­iza­tion to new lev­els within the same dis­tri­bu­tion.

Zach’s opinion: The GVG-AI frame­work seems like a use­ful tool to ex­plore learn­ing videogames. Set­ting up cur­ricu­lum learn­ing by us­ing PPCG is also a clever idea. How­ever, the re­sults are a bit mixed. On two of the games they tested, train­ing on a sin­gle difficult level works bet­ter than train­ing on a va­ri­ety of lev­els for gen­er­al­iza­tion. Hav­ing said this, the method can learn the game Frogs (57% win rate) while DQN/​A2C make zero progress even af­ter 40 mil­lion steps. It seems as though cer­tain con­di­tions make PPCG a good method to use. It’d be in­ter­est­ing to in­ves­ti­gate what those con­di­tions are in a fu­ture pub­li­ca­tion.

## DEEP LEARNING

SLIDE : In Defense of Smart Al­gorithms over Hard­ware Ac­cel­er­a­tion for Large-Scale Deep Learn­ing Sys­tems (Beidi Chen et al) (sum­ma­rized by Asya): This pa­per pre­sents an al­gorith­mic tech­nique called SLIDE (Sub-LIn­ear Deep learn­ing Eng­ine) which takes ad­van­tage of spar­sity in in­puts and ac­ti­va­tions to speed up the train­ing of large neu­ral net­works.

Sup­pose that ac­ti­va­tions at layer k are a_k. Then, the ith el­e­ment of a_{k+1} is given by the dot product of a_k and w_i for some weight vec­tor w_i. Call w_i the ith neu­ron of layer k + 1. The largest ac­ti­va­tions in a_{k+1} are the ones for whom w_i has high mag­ni­tude and points in the same di­rec­tion as a_k. The core pro­posal of SLIDE is to only com­pute the largest el­e­ments of a_{k+1}, which they call the “ac­ti­vated neu­rons”, and ap­prox­i­mate all of the oth­ers are zero, al­low­ing us to avoid a lot of com­pu­ta­tion.

In or­der to do this, we main­tain a data struc­ture called a lo­cal­ity-sen­si­tive hash table, which when given an ac­ti­va­tion a_k can tell us which neu­rons (w_is) are most similar. We can then com­pute the out­puts for just those neu­rons to get a_{k+1}. In this way, we can effec­tively ‘spar­sify’ the net­work, calcu­lat­ing the ac­ti­va­tions and up­dat­ing the weights of only a small sub­set of the neu­rons. This is what gives us our com­pu­ta­tional gains.

SLIDE ran­domly ini­tial­izes weights in the net­work and gen­er­ates the lo­cal­ity-sen­si­tive hash table that maps ac­ti­va­tions to ac­ti­vated neu­rons. To take a gra­di­ent step on an in­put, it calcu­lates the ac­ti­vated neu­rons in a for­ward pass, then back­prop­a­gates through the ac­ti­vated neu­rons, and then up­dates the lo­cal­ity-sen­si­tive hash table. The hash table up­date is com­pu­ta­tion­ally ex­pen­sive, and SLIDE uses sev­eral mechanisms to make it less costly, such as up­dat­ing hash ta­bles less fre­quently later in the train­ing pro­cess since gra­di­ents are likely to change less then. Due to the spar­sity, the gra­di­ents for differ­ent in­puts are of­ten chang­ing differ­ent neu­rons, and so SLIDE asyn­chronously par­allelizes gra­di­ent up­dates with­out wor­ry­ing about race con­di­tions, al­low­ing for much bet­ter scal­ing with ad­di­tional cores.

The pa­per eval­u­ates SLIDE on large multi-la­bel clas­sifi­ca­tion tasks, which must run on neu­ral net­works with ex­tremely wide fi­nal lay­ers. It finds that the CPUs run­ning SLIDE are 1.8 times faster in clock-time than the GPU on the Deli­cious 200k dataset, and 2.7 times faster than the GPU on the Ama­zon-670K dataset, with an ad­di­tional ~1.3x speed-up af­ter perform­ing cache op­ti­miza­tion on SLIDE. Scal­a­bil­ity tests sug­gest that the SLIDE CPUs beat GPU perfor­mance even when us­ing only 8 cores. The pa­per claims that SLIDE’s com­pu­ta­tional benefits come be­cause the num­ber of neu­rons sam­pled in the wide fi­nal layer is ex­tremely small—fewer than 0.5% of ac­tive neu­rons.

Asya’s opinion: The tasks they test on are ex­tremely sparse: since there are hun­dreds of thou­sands of pos­si­ble la­bels, even if you take the top ~thou­sand pre­dic­tions in the fi­nal layer (which cor­re­sponds to most of the com­pu­ta­tion), that’s only 1% of the to­tal num­ber of pre­dic­tions, sav­ing you 99% of the ar­ith­metic you would have had to do. The in­put fea­tures are also very sparse: in both datasets, less than 0.06% (yes, per­cent) of fea­tures are non-zero. It’s cool that un­der such con­di­tions you can de­sign an al­gorithm that is ~an or­der of mag­ni­tude bet­ter on cost, but it’s not go­ing to be “the death of NVIDIA” or any­thing like that — with­out fur­ther op­ti­miza­tions, SLIDE will be worse than reg­u­lar Ten­sorflow on GPU for some­thing like ImageNet.

I’m also not sure I agree with the ‘the­sis’ of the pa­per that smart al­gorithms beat hard­ware ac­cel­er­a­tion—it seems to me like there are large gains from in­vest­ing in the com­bi­na­tion of the two. Even if GPUs aren’t op­ti­mized to run SLIDE, I can imag­ine spe­cial­ized hard­ware op­ti­mized for SLIDE cre­at­ing even big­ger perfor­mance gains.

Lin­ear Mode Con­nec­tivity and the Lot­tery Ticket Hy­poth­e­sis (Jonathan Fran­kle et al) (sum­ma­rized by Flo): In­sta­bil­ity anal­y­sis looks at how sen­si­tive neu­ral net­work train­ing is to noise in SGD. A net­work is called sta­ble if the test er­ror re­mains ap­prox­i­mately con­stant along the line con­nect­ing net­work weights ob­tained by train­ing on differ­ently or­dered data.

The au­thors find that most pop­u­lar net­works in image clas­sifi­ca­tion are un­sta­ble at ini­tial­iza­tion for more challeng­ing tasks but be­come sta­ble long be­fore con­ver­gence. They also find that win­ning tick­ets (AN #77) found by iter­a­tive mag­ni­tude prun­ing are usu­ally sta­ble, while un­sta­ble sub­net­works don’t man­age to match the origi­nal net­work’s perfor­mance af­ter train­ing. As the origi­nal net­work, pruned sub­net­works be­come more sta­ble when they are ini­tial­ized with weights from later stages of the train­ing pro­cess. This is con­sis­tent with pre­vi­ous re­sults show­ing that re­set­ting sub­net­work weights to states in early train­ing leads to in­creased perfor­mance af­ter re­train­ing, com­pared to re­set­ting to the ini­tial state. While sta­bil­ity seems to cor­re­spond to bet­ter ac­cu­racy for sub­net­works, very sparse sub­net­works perform worse than the un­pruned net­work, even if they are sta­ble.

Flo’s opinion: The cor­re­spon­dence be­tween sub­net­work sta­bil­ity and perfor­mance af­ter re­train­ing might just be an arte­fact of both (some­what ob­vi­ously) im­prov­ing with more train­ing. What is in­ter­est­ing is that small amounts of train­ing seem to have dis­pro­por­tionate effects for both fac­tors, al­though one should keep in mind that the same is true for the loss, at least in ab­solute terms.

# NEWS

Ca­reers at the Joint AI Cen­ter (sum­ma­rized by Ro­hin) (H/​T Jon Ro­driguez): The Joint AI Cen­ter is search­ing for ML ex­perts for a va­ri­ety of roles.

Ro­hin’s opinion: You might be won­der­ing why I’ve in­cluded these jobs in the newslet­ter, given that I don’t do very many pro­mo­tions. I think that it is rea­son­ably likely that the US gov­ern­ment (and the mil­i­tary in par­tic­u­lar) will be a key player in the fu­ture of AI, and that there could be a lot to learn from their test­ing, eval­u­a­tion, val­i­da­tion & ver­ifi­ca­tion (TEV&V) frame­work (which of­ten seems more risk-averse to me than many al­ign­ment schemes are). As a re­sult, I would be ex­cited if read­ers of this newslet­ter in­ter­ested in how the mil­i­tary thinks about AI filled these po­si­tions: it seems great to have a flow of ideas be­tween the two com­mu­ni­ties (so that the gov­ern­ment learns about al­ign­ment con­cerns, and so that we learn about TEV&V).

#### FEEDBACK

I’m always happy to hear feed­back; you can send it to me, Ro­hin Shah, by re­ply­ing to this email.

#### PODCAST

An au­dio pod­cast ver­sion of the Align­ment Newslet­ter is available. This pod­cast is an au­dio ver­sion of the newslet­ter, recorded by Robert Miles.

• Is SIDLE not also a perfectly fine word? I don’t know how this went through peer re­view.

Any­how, good newslet­ter this week, thanks :)