Brains and backprop: a key timeline crux

[Cross­posted from my blog]

The Se­cret Sauce Question

Hu­man brains still out­perform deep learn­ing al­gorithms in a wide va­ri­ety of tasks, such as play­ing soc­cer or know­ing that it’s a bad idea to drive off a cliff with­out hav­ing to try first (for more for­mal ex­am­ples, see Lake et al., 2017; Hin­ton, 2017; LeCun, 2018; Ir­pan, 2018). This fact can be taken as ev­i­dence for two differ­ent hy­pothe­ses:

  1. In or­der to de­velop hu­man-level AI, we have to de­velop en­tirely new learn­ing al­gorithms. At the mo­ment, AI is a deep con­cep­tual prob­lem.

  2. In or­der to de­velop hu­man-level AI, we ba­si­cally just have to im­prove cur­rent deep learn­ing al­gorithms (and their hard­ware) a lot. At the mo­ment, AI is an en­g­ineer­ing prob­lem.

The ques­tion of which of these views is right I call “the se­cret sauce ques­tion”.

The se­cret sauce ques­tion seems like one of the most im­por­tant con­sid­er­a­tions in es­ti­mat­ing how long there is left un­til the de­vel­op­ment of hu­man-level ar­tifi­cial in­tel­li­gence (“timelines”). If some­thing like 2) is true, timelines are ar­guably sub­stan­tially shorter than if some­thing like 1) is true [1].

How­ever, it seems ini­tially difficult to ar­bi­trate these two vague, high-level views. It ap­pears as if though an an­swer re­quires com­pli­cated in­side views stem­ming from deep and wide knowl­edge of cur­rent tech­ni­cal AI re­search. This is partly true. Yet this post pro­poses that there might also be sin­gle, con­crete dis­cov­ery ca­pa­ble of set­tling the se­cret sauce ques­tion: does the hu­man brain learn us­ing gra­di­ent de­scent, by im­ple­ment­ing back­prop­a­ga­tion?

The im­por­tance of backpropagation

Un­der­ly­ing the suc­cess of mod­ern deep learn­ing is a sin­gle al­gorithm: gra­di­ent de­scent with back­prop­a­ga­tion of er­ror (LeCun et al., 2015). In fact, the ma­jor­ity of re­search is not fo­cused on find­ing bet­ter al­gorithms, but rather on find­ing bet­ter cost func­tions to de­scend us­ing this al­gorithm (Mar­ble­stone et al., 2016). Yet, in stark con­trast to this suc­cess, since the 1980’s the key ob­jec­tion of neu­ro­scien­tists to deep learn­ing has been that back­prop­a­ga­tion is not biolog­i­cally plau­si­ble (Crick, 1989; Stork, 1989).

As a re­sult, the ques­tion of whether the brain im­ple­ments back­prop­a­ga­tion pro­vides crit­i­cal ev­i­dence on the se­cret sauce prob­lem. If the brain does not use it, and still out­performs deep learn­ing while run­ning on the en­ergy of a lap­top and train­ing on sev­eral or­ders of mag­ni­tude fewer train­ing ex­am­ples than pa­ram­e­ters, this sug­gests that a deep con­cep­tual ad­vance is nec­es­sary to build hu­man-level ar­tifi­cial in­tel­li­gence. There’s some other re­mark­able al­gorithm out there, and evolu­tion found it. But if the brain does use back­prop, then the rea­son deep learn­ing works so well is be­cause it’s some­how on the right track. Hu­man re­searchers and evolu­tion con­verged on a com­mon solu­tion to the prob­lem of op­ti­mis­ing large net­works of neu­ron-like units. (Th­ese ar­gu­ments as­sume that if a solu­tion is biolog­i­cally plau­si­ble and the best solu­tion available, then it would have evolved).

Ac­tu­ally, the situ­a­tion is a bit more nu­anced than this, and I think it can be clar­ified by dis­t­in­guish­ing be­tween al­gorithms that are:

Biolog­i­cally ac­tual: What the brain ac­tu­ally does.

Biolog­i­cally plau­si­ble: What the brain might have done, while still be­ing re­stricted by evolu­tion­ary se­lec­tion pres­sure to­wards en­ergy effi­ciency etc.

For ex­am­ple, hu­mans walk with legs, but it seems pos­si­ble that evolu­tion might have given us wings or fins in­stead, as those solu­tions work for other an­i­mals. How­ever, evolu­tion could not have given us wheels, as that re­quires a sep­a­rable axle and wheel, and it’s un­clear what an evolu­tion­ary path to an or­ganism with two sep­a­rable parts looks like (ex­clud­ing sym­biotic re­la­tion­ships).

Biolog­i­cally pos­si­ble: What is tech­ni­cally pos­si­ble to do with col­lec­tions of cells, re­gard­less of its rel­a­tive evolu­tion­ary ad­van­tage.

For ex­am­ple, even though evolv­ing wheels is im­plau­si­ble, there might be no in­her­ent prob­lem with an or­ganism hav­ing wheels (cre­ated by “God”, say), in the way in which there’s an in­her­ent prob­lem with an or­ganism’s ax­ons send­ing ac­tion po­ten­tials faster than the speed of light.

I think this leads to the fol­low­ing con­clu­sions:

Na­ture of back­prop: Im­pli­ca­tion for timelines

Biolog­i­cally im­pos­si­ble: Un­clear, there might be mul­ti­ple “se­cret sauces”

Biolog­i­cally pos­si­ble, but not plau­si­ble: Same as above

Biolog­i­cally plau­si­ble, but not ac­tual: Timelines are long, there’s likely a “se­cret sauce”

Biolog­i­cally ac­tual: Timelines are short, there’s likely no “se­cret sauce”

In cases where evolu­tion could not in­vent back­prop any­way, it’s hard to com­pare things. That is con­sis­tent both with back­prop not be­ing the right way to go and with it be­ing bet­ter than what­ever evolu­tion did.

It might be ob­jected that this ques­tion doesn’t re­ally mat­ter, since if neu­ro­scien­tists found out that the brain does back­prop, they have not thereby cre­ated any new al­gorithm—but merely given stronger ev­i­dence for the work­a­bil­ity of pre­vi­ous al­gorithms. Deep learn­ing re­searchers wouldn’t find this any more use­ful than Usain Bolt would find it use­ful to know that his start­ing stance dur­ing the sprint count­down is op­ti­mal: he’s been us­ing it for years any­way, and is mostly just ea­ger to go back to the gym.

How­ever, this ar­gu­ment seems mis­taken.

On the one hand, just be­cause it’s not use­ful to deep learn­ing prac­ti­tion­ers does not mean it’s not use­ful oth­ers try­ing to es­ti­mated the timelines of tech­nolog­i­cal de­vel­op­ment (such as policy-mak­ers or char­i­ta­ble foun­da­tions).

On the other hand, I think this knowl­edge is very prac­ti­cally use­ful for deep learn­ing prac­ti­tion­ers. Ac­cord­ing to my cur­rent mod­els, the field seems unique in com­bin­ing the fol­low­ing fea­tures:

  • Long iter­a­tion loops (on the or­der of GPU-weeks to GPU-years) for test­ing new ideas.

  • High de­pen­dence of perfor­mance on hy­per­pa­ram­e­ters, such that the right al­gorithm with slightly off hy­per­pa­ram­e­ters will not work at all.

  • High de­pen­dence of perfor­mance on the amount of com­pute ac­cessible, such that the differ­ences be­tween enough and al­most enough are step-like, or qual­i­ta­tive rather than quan­ti­ta­tive. Too lit­tle com­pute and the al­gorithm just doesn’t work at all.

  • Lack of a unified set of first prin­ci­ples for un­der­stand­ing the prob­lems, and in­stead a col­lec­tion of effec­tive heuristics

This is an en­vi­ron­ment where it is crit­i­cally im­por­tant to de­velop strong pri­ors on what should work, and to stick with those in face countless fruitless tests. In­deed, LeCun, Hin­ton and Ben­gio seem to have per­se­vered for decades be­fore the AI com­mu­nity stopped think­ing they were crazy. (This is similar in some in­ter­est­ing ways to the state of as­tron­omy and physics be­fore New­ton. I’ve blogged about this be­fore here.) There’s an asym­me­try such that even though train­ing a very pow­er­ful ar­chi­tec­ture can be quick (on the or­der of a GPU-day), iter­at­ing over ar­chi­tec­tures to figure out which ones to train fully in the first place can be in­cred­ibly costly. As such, know­ing whether gra­di­ent de­scent with back­prop is or is not the way to go would lead en­able more effi­cient al­lo­ca­tion of re­search time (though mostly so in case back­prop is not the way to go, as the ma­jor­ity of cur­rent re­searchers as­sume it any­way).

Ap­pendix: Brief the­o­ret­i­cal background

This sec­tion de­scribes what back­prop­a­ga­tion is, why neu­ro­scien­tists have claimed it is im­plau­si­ble, and why some deep learn­ing re­searchers think those neu­ro­scien­tists are wrong. The lat­ter ar­gu­ments are ba­si­cally sum­marised from this talk by Hin­ton.

Multi-layer net­works with ac­cess to an er­ror sig­nal face the so-called “credit as­sign­ment prob­lem”. The er­ror of the com­pu­ta­tion will only be available at the out­put: a child pro­nounc­ing a word er­ro­neously, a ro­dent tast­ing an un­ex­pect­edly nau­se­at­ing liquid, a mon­key mis­tak­ing a stick for a snake. How­ever, in or­der for the net­work to im­prove its rep­re­sen­ta­tions and avoid mak­ing the same mis­take in the fu­ture, it has to know which rep­re­sen­ta­tions to “blame” for the mis­take. Is the mon­key too prone to think long things are snakes? Or is it bad at dis­crim­i­nat­ing the tex­tures of wood and skin? Or is it bad at tel­ling eyes from eye-sized bumps? And so forth. This prob­lem is ex­ac­er­bated by the fact that neu­ral net­work mod­els of­ten have tens or hun­dreds of thou­sands of pa­ram­e­ters, not to men­tion the hu­man brain, which is es­ti­mated to have on the or­der of 1014 synapses. Back­prop­a­ga­tion pro­poses to solve this prob­lem by ob­serv­ing that the maths of gra­di­ent de­scent work out such that one can es­sen­tially send the er­ror sig­nal from the out­put, back through the net­work to­wards the in­put, mod­u­lat­ing it by the strength of the con­nec­tions along the way. (A com­ple­men­tary per­spec­tive on back­prop is that it is just an effi­cient way of com­put­ing deriva­tives in large com­pu­ta­tional graphs, see e.g. Olah, 2015).

Now why do some neu­ro­scien­tists have a prob­lem with this?

Ob­jec­tion 1:

Most learn­ing in the brain is un­su­per­vised, with­out any er­ror sig­nal similar to those used in su­per­vised learn­ing.

Hin­ton’s re­ply:

There are at least three ways of do­ing back­prop­a­ga­tion with­out an ex­ter­nal su­per­vi­sion sig­nal:

1. Try to re­con­struct the origi­nal in­put (us­ing e.g. auto-en­coders), and thereby de­velop rep­re­sen­ta­tions sen­si­tive to the statis­tics of the in­put do­main

2. Use the broader con­text of the in­put to train lo­cal features

For ex­am­ple, in the sen­tence “She scromed him with the fry­ing pan”, we can in­fer that the sen­tence as a whole doesn’t sound very pleas­ant, and use that to up­date our rep­re­sen­ta­tion of the novel word “scrom”

3. Learn a gen­er­a­tive model that as­signs high prob­a­bil­ity to the in­put (e.g. us­ing vari­a­tional auto-en­coders or the wake-sleep al­gorithm from the 1990’s)

Ben­gio and col­leagues (2017) have also done in­ter­est­ing work on this, partly re­viv­ing en­ergy-min­imis­ing Hopfield net­works from the 1980’s

Ob­jec­tion 2:

Ob­jec­tion 2. Neu­rons com­mu­ni­cate us­ing bi­nary spikes, rather than real val­ues (this was among the ear­liest ob­jec­tions to back­prop).

Hin­ton’s re­ply:

First, one can just send spikes stochas­ti­cally and use the ex­pected spike rate (e.g. with a pois­son rate, which is some­what close to what real neu­rons do, al­though there are im­por­tant differ­ences see e.g., Ma et al., 2006; Pouget et al. 2003).

Se­cond, this might make evolu­tion­ary sense, as the stochas­tic­ity acts as a reg­u­laris­ing mechanism mak­ing the net­work more ro­bust to overfit­ting. This be­havi­our is in fact where Hin­ton got the idea for the drop-out al­gorithm (which has been very pop­u­lar, though it re­cently seems to have been largely re­placed by batch nor­mal­i­sa­tion).

Ob­jec­tion 3:

Sin­gle neu­rons can­not rep­re­sent two dis­tinct kind of quan­tities, as would be re­quired to do back­prop (the pres­ence of fea­tures and gra­di­ents for train­ing).

Hin­ton’s re­ply:

This is in fact pos­si­ble. One can use the tem­po­ral deriva­tive of the neu­ronal ac­tivity to rep­re­sent gra­di­ents.

(There is in­ter­est­ing neu­ropsy­cholog­i­cal ev­i­dence sup­port­ing the idea that the tem­po­ral deriva­tive of a neu­ron can not be used to rep­re­sent changes in that fea­ture, and that differ­ent pop­u­la­tions of neu­rons are re­quired to rep­re­sent the pres­ence and the change of a fea­ture. Pa­tients with cer­tain brain dam­age seem able to recog­nise that a mov­ing car oc­cu­pies differ­ent lo­ca­tions at two points in time, with­out be­ing able to ever de­tect the car chang­ing po­si­tion.)

Ob­jec­tion 4:

Cor­ti­cal con­nec­tions only trans­mit in­for­ma­tion in one di­rec­tion (from soma to synapse), and the kinds of back­pro­jec­tions that ex­ist are far from the perfectly sym­met­ric ones used for back­prop.

Hin­ton’s re­ply:

This led him to aban­don the idea that the brain could do back­prop­a­ga­tion for a decade, un­til “a mir­a­cle ap­peared”. Lillicrap and col­leagues at Deep­Mind (2016) found that a net­work prop­a­gat­ing gra­di­ents back through ran­dom and fixed feed­back weights in the hid­den layer can match the perfor­mance of one us­ing or­di­nary back­prop, given a mechanism for nor­mal­iza­tion and un­der the as­sump­tion that the weights pre­serve the sign of the gra­di­ents. This is a re­mark­able and sur­pris­ing re­sult, and in­di­cates that back­prop is still poorly un­der­stood. (See also fol­low-up work by Liao et al., 2016).

[1] One pos­si­ble ar­gu­ment for this is that in a larger num­ber of plau­si­ble wor­lds, if 2) is true and con­cep­tual ad­vances are nec­es­sary, then build­ing su­per­in­tel­li­gence will turn into an en­g­ineer­ing prob­lem once those ad­vances have been made. Hence 2) re­quires strictly more re­sources than 1).

Dis­cus­sion questions

I’d en­courage dis­cus­sion on:

Whether the brain does back­prop (ob­ject-level dis­cus­sion on the work of Lillicrap, Hin­ton, Ben­gio, Liao and oth­ers)?

Whether it’s ac­tu­ally im­por­tant for the se­cret sauce ques­tion to know whether the brain does back­prop?

To keep things fo­cused and man­age­able, it seems rea­son­able to dis­en­courage dis­cus­sion of what other se­cret sauces there might be.