Deep learning—deeper flaws?

Link post

In this post I sum­marise four lines of ar­gu­ment for why we should be skep­ti­cal about the po­ten­tial of deep learn­ing in its cur­rent form. I am fairly con­fi­dent that the next break­throughs in AI will come from some va­ri­ety of neu­ral net­work, but I think sev­eral of the ob­jec­tions be­low are quite a long way from be­ing over­come.

The­o­ret­i­cal Im­ped­i­ments to Ma­chine Learn­ing With Seven Sparks from the Causal Revolu­tion—Pearl, 2018

Pearl de­scribes three lev­els at which you can make in­fer­ences: as­so­ci­a­tion, in­ter­ven­tion, and coun­ter­fac­tual. The first is statis­ti­cal, iden­ti­fy­ing cor­re­la­tions—this is the level at which deep learn­ing op­er­ates. The in­ter­ven­tion level is about changes to the pre­sent or fu­ture—it an­swers ques­tions like “What will hap­pen if I do y?” The coun­ter­fac­tual level an­swers ques­tions like “What would have hap­pened if y had oc­curred?” Each suc­ces­sive level is strictly more pow­er­ful than the pre­vi­ous one: you can’t figure out what the effects of an ac­tion will be just on the as­so­ci­a­tion level, with­out a causal model, since we treat ac­tions as in­ter­ven­tions which over­ride ex­ist­ing causes. Un­for­tu­nately, cur­rent ma­chine learn­ing sys­tems are largely model-free.

Causal as­sump­tions and con­clu­sions can be en­coded in the form of graph­i­cal mod­els, where a di­rected ar­row be­tween two nodes rep­re­sents a causal in­fluence. Con­straints on the struc­ture of a graph can be de­ter­mined by see­ing which pairs of vari­ables are in­de­pen­dent when con­trol­ling for which other vari­ables: some­times con­trol­ling re­moves de­pen­den­cies, but some­times it in­tro­duces them. Pearl’s main claim is that this sort of model-driven causal anal­y­sis is an es­sen­tial step to­wards build­ing hu­man-level rea­son­ing ca­pa­bil­ities. He iden­ti­fies sev­eral im­por­tant con­cepts—such as coun­ter­fac­tu­als, con­found­ing, cau­sa­tion, and in­com­plete or bi­ased data—which his frame­work is able to rea­son about, but which cur­rent ap­proaches to ML can­not deal with.

Deep Learn­ing: A Crit­i­cal Ap­praisal—Mar­cus, 2018

Mar­cus iden­ti­fies ten limi­ta­tions of cur­rent deep learn­ing sys­tems, and ar­gues that the whole field may be about to hit a wall. Ac­cord­ing to him, deep learn­ing:

  1. Is data hun­gry—it can’t learn ab­strac­tions through ex­plicit ver­bal defi­ni­tion like hu­mans can, but in­stead re­quires thou­sands of ex­am­ples.

  2. Is shal­low, with limited ca­pac­ity for trans­fer. If a task is per­turbed even in minor ways, deep learn­ing breaks, demon­strat­ing that it’s not re­ally learn­ing the un­der­ly­ing con­cepts. Ad­ver­sar­ial ex­am­ples show­case this effect.

  3. Has no nat­u­ral way to deal with hi­er­ar­chi­cal struc­ture. Even re­cur­sive neu­ral net­works re­quire fixed sen­tence trees to be pre­com­puted. See my sum­mary of ‘Gen­er­al­i­sa­tion with­out sys­tem­at­ic­ity’ be­low.

  4. Strug­gles with open-ended in­fer­ence, es­pe­cially based on real-world knowl­edge.

  5. Isn’t trans­par­ent, and re­mains es­sen­tially a “black box”.

  6. Is not well-in­te­grated with prior knowl­edge. We can’t en­code our un­der­stand­ing of physics into a neu­ral net­work, for ex­am­ple.

  7. Can­not dis­t­in­guish cau­sa­tion from cor­re­la­tion—see my sum­mary of Pearl’s pa­per above.

  8. Pre­sumes a largely sta­ble world, like a game, in­stead of one like our own in which there are large-scale changes.

  9. Is vuln­er­a­ble to ad­ver­sar­ial ex­am­ples, which can be con­structed quite eas­ily.

  10. Isn’t ro­bust as a long-term en­g­ineer­ing solu­tion, es­pe­cially on novel data.

Some of these prob­lems seem like they can be over­come with­out novel in­sights, given enough en­g­ineer­ing effort and com­pute, but oth­ers are more fun­da­men­tal. One in­ter­pre­ta­tion: deep learn­ing can in­ter­po­late within the train­ing space, but can’t ex­trap­o­late to out­side the train­ing space, even in ways which seem nat­u­ral to hu­mans. One of Mar­cus’ ex­am­ples: when a neu­ral net­work is trained to learn the iden­tity func­tion on even num­bers, it rounds down on odd num­bers. In this triv­ial case we can solve the prob­lem by adding odd train­ing ex­am­ples or man­u­ally ad­just­ing some weights, but in gen­eral, when there are many fea­tures, both may be pro­hibitively difficult even if we want to make a sim­ple ad­just­ment. To ad­dress this and other prob­lems, Mar­cus offers three al­ter­na­tives to deep learn­ing as cur­rently prac­ticed:

  1. Un­su­per­vised learn­ing, so that sys­tems can con­stantly im­prove—for ex­am­ple by pre­dict­ing the next time-step and up­dat­ing af­ter­wards, or else by set­ting it­self challenges and learn­ing from do­ing them.

  2. Fur­ther de­vel­op­ment of sym­bolic AI. While this has in the past proved brit­tle, the idea of in­te­grat­ing sym­bolic rep­re­sen­ta­tions into neu­ral net­works has great promise.

  3. Draw­ing in­spira­tion from hu­mans, in par­tic­u­lar from cog­ni­tive and de­vel­op­men­tal psy­chol­ogy, how we de­velop com­mon­sense knowl­edge, and our un­der­stand­ing of nar­ra­tive.

Gen­er­al­i­sa­tion with­out sys­tem­at­ic­ity—Lake and Ba­roni, 2018

Lake and Ba­roni iden­tify that hu­man lan­guage and thought fea­ture “sys­tem­atic com­po­si­tion­al­ity”: we are able to com­bine known com­po­nents in novel ways to pro­duce ar­bi­trar­ily many new ideas. To test neu­ral net­works on this, they in­tro­duce SCAN, a lan­guage con­sist­ing of com­mands such as “jump around left twice and walk op­po­site right thrice”. While they found that RNNs were able to gen­er­al­ise well on new strings similar in form to pre­vi­ous strings, perfor­mance dropped sharply in other cases. For ex­am­ple, the best re­sult dropped from 99.9% to 20.8% when the test ex­am­ples were longer than any train­ing ex­am­ple, even though they were con­structed us­ing the same com­po­si­tional rules. Also, when a com­mand such as “jump” had only been seen by it­self in train­ing, RNNs were al­most en­tirely in­ca­pable of un­der­stand­ing in­struc­tions such as “turn right and jump”. The over­all con­clu­sion: that neu­ral net­works can’t ex­tract sys­tem­atic rules from train­ing data, and so can’t gen­er­al­ise com­po­si­tion­al­ity any­thing like how hu­mans can. This is similar to the re­sult of a pro­ject I re­cently car­ried out, in which I found that cap­sule net­works which had been trained to recog­nise trans­formed in­puts such as ro­tated digits and digits with nega­tive colours still couldn’t recog­nise ro­tated, negated digits: they were sim­ply not learn­ing gen­eral rules which could be com­posed to­gether.

Deep re­in­force­ment learn­ing doesn’t work yet—Ir­pan, 2018

Ir­pan runs through a num­ber of rea­sons to be skep­ti­cal about us­ing deep learn­ing for RL prob­lems. For one thing, deep RL is still very data-in­effi­cient: Deep­Mind’s Rain­bow DQN takes around 83 hours of game­play to reach hu­man-level perfor­mance on an Atari game. By con­trast, hu­mans can pick them up within a minute or two. He also points out that other RL meth­ods of­ten work bet­ter than deep RL, par­tic­u­larly model-based ones which can util­ise do­main-spe­cific knowl­edge.

Another is­sue with RL in gen­eral is that de­sign­ing re­ward func­tions is difficult. This is a theme in AI safety—speci­fi­cally when it comes to re­ward func­tions which en­cap­su­late hu­man val­ues—but there are plenty of ex­ist­ing ex­am­ples of re­ward hack­ing on much sim­pler tasks. One im­por­tant con­sid­er­a­tion is the trade­off is be­tween shaped and sparse re­wards. Sparse re­wards only oc­cur at the goal state, and so can be fairly pre­cise, but are usu­ally too difficult to reach di­rectly. Shaped re­wards give pos­i­tive feed­back more fre­quently, but are eas­ier to hack. And even when shaped re­wards are de­signed care­fully, RL agents of­ten find them­selves in lo­cal op­tima. This is par­tic­u­larly preva­lent in multi-agent sys­tems, where each agent can overfit to the be­havi­our of the other.

Lastly, RL is un­sta­ble in a way that su­per­vised learn­ing isn’t. Even suc­cess­ful im­ple­men­ta­tions of­ten fail to find a de­cent solu­tion 20 or 30% of the time, de­pend­ing on the ran­dom seed with which they are ini­tial­ised. In fact, there are very few real-world suc­cess sto­ries fea­tur­ing RL. Yet achiev­ing su­per­hu­man perfor­mance on a wide range of tasks is a mat­ter of when, not if, and so I think Amara’s law ap­plies: we over­es­ti­mate the effects RL will have in the short run, but un­der­es­ti­mate its effects in the long run.