Gary Marcus vs Cortical Uniformity

Back­ground /​ context

I wrote about cor­ti­cal unifor­mity last year in Hu­man In­stincts, Sym­bol Ground­ing, and the Blank Slate Neo­cor­tex. (Other less­wrong dis­cus­sion in­cludes Alex Zhu re­cently and Ja­cob Can­nell in 2015.) Here was my de­scrip­tion (lightly ed­ited, and omit­ting sev­eral foot­notes that were in the origi­nal):

In­stead of say­ing that the hu­man brain has a vi­sion pro­cess­ing al­gorithm, mo­tor con­trol al­gorithm, lan­guage al­gorithm, plan­ning al­gorithm, and so on, in “Com­mon Cor­ti­cal Al­gorithm” (CCA) the­ory we say that (to a first ap­prox­i­ma­tion) we have a mas­sive amount of “gen­eral-pur­pose neo­cor­ti­cal tis­sue”, and if you dump vi­sual in­for­ma­tion into that tis­sue, it does vi­sual pro­cess­ing, and if you con­nect that tis­sue to mo­tor con­trol path­ways, it does mo­tor con­trol, etc.

CCA the­ory, as I’m us­ing the term, is a sim­plified model. There are al­most definitely a cou­ple caveats to it:

  1. There are sorta “hy­per­pa­ram­e­ters” on the generic learn­ing al­gorithm which seem to be set differ­ently in differ­ent parts of the neo­cor­tex. For ex­am­ple, some ar­eas of the cor­tex have higher or lower den­sity of par­tic­u­lar neu­ron types. There are other ex­am­ples too. I don’t think this sig­nifi­cantly un­der­mines the use­ful­ness or cor­rect­ness of CCA the­ory, as long as these changes re­ally are akin to hy­per­pa­ram­e­ters, as op­posed to spec­i­fy­ing fun­da­men­tally differ­ent al­gorithms. So my read­ing of the ev­i­dence is that if you put, say, mo­tor nerves com­ing out of vi­sual cor­tex tis­sue, the tis­sue could do mo­tor con­trol, but it wouldn’t do it quite as well as the mo­tor cor­tex does.

  2. There is al­most definitely a gross wiring di­a­gram hard­coded in the genome—i.e., set of con­nec­tions be­tween differ­ent neo­cor­ti­cal re­gions and each other, and other parts of the brain. Th­ese con­nec­tions later get re­fined and ed­ited dur­ing learn­ing. Again, we can ask how much the ex­is­tence of this in­nate gross wiring di­a­gram un­der­mines CCA the­ory. How com­pli­cated is the wiring di­a­gram? Is it mil­lions of con­nec­tions among thou­sands of tiny re­gions, or just tens of con­nec­tions among a few re­gions? Would the brain work at all if you started with a ran­dom wiring di­a­gram? I don’t know for sure, but for var­i­ous rea­sons, my cur­rent be­lief is that this ini­tial gross wiring di­a­gram is not car­ry­ing much of the weight of hu­man in­tel­li­gence, and thus that this point is not a sig­nifi­cant prob­lem for the use­ful­ness of CCA the­ory. (This is a loose state­ment; of course it de­pends on what ques­tions you’re ask­ing.) I think of it more like: if it’s biolog­i­cally im­por­tant to learn a con­cept space that’s built out of as­so­ci­a­tions be­tween in­for­ma­tion sources X, Y, and Z, well, you just dump those three in­for­ma­tion streams into the same part of the cor­tex, and then the CCA will take it from there, and it will re­li­ably build this con­cept space. So once you have the CCA nailed down, it kinda feels to me like you’re most of the way there....

Mar­cus et al.’s challenge

Now, when I was re­search­ing that post last year, I had read one book chap­ter op­posed to cor­ti­cal unifor­mity and an­other book chap­ter in fa­vor of cor­ti­cal unifor­mity, which were a good start, but I’ve been keep­ing my eye out for more on the topic. And I just found one! In 2014 Gary Mar­cus, Adam Mar­ble­stone, and Thomas Dean wrote a lit­tle com­men­tary in Science Magaz­ine called The Atoms of Neu­ral Com­pu­ta­tion, with a case against cor­ti­cal unifor­mity.

Out of the var­i­ous things they wrote, one stands out as the most sub­stan­tive and se­ri­ous crit­i­cism: They throw down a gaunt­let in their FAQ, with a table of 10 fun­da­men­tally differ­ent calcu­la­tions that they think the neo­cor­tex does. Can one com­mon cor­ti­cal al­gorithm re­ally sub­sume or re­place all those differ­ent things?

Well, I ac­cept the challenge!!

But first, I bet­ter say some­thing about what there com­mon cor­ti­cal al­gorithm is and does, with the caveat that no­body knows all the de­tails, and cer­tainly not me. (The fol­low­ing para­graph is mostly in­fluenced by read­ing a bunch of stuff by Dileep Ge­orge & Jeff Hawk­ins, along with mis­cel­la­neous other books and pa­pers that I’ve hap­pened across in my to­tally ran­dom and in­com­plete neu­ro­science and AI self-ed­u­ca­tion.)

The com­mon cor­ti­cal al­gorithm (ac­cord­ing to me, and leav­ing out lots of as­pects that aren’t es­sen­tial for this post) is an al­gorithm that builds a bunch of gen­er­a­tive mod­els, each of which con­sists of pre­dic­tions that other gen­er­a­tive mod­els are on or off, and/​or pre­dic­tions that in­put chan­nels (com­ing from out­side the neo­cor­tex—vi­sion, hunger, etc.) are on or off. (“It’s sym­bols all the way down.”) All the pre­dic­tions are at­tached to con­fi­dence val­ues, and both the pre­dic­tions and con­fi­dence val­ues are, in gen­eral, func­tions of time (or of other pa­ram­e­ters … again, I’m gloss­ing over de­tails here). The gen­er­a­tive mod­els are com­po­si­tional, be­cause if two of them make dis­joint and/​or con­sis­tent pre­dic­tions, you can cre­ate a new model that sim­ply pre­dicts that both of those two com­po­nent mod­els are ac­tive si­mul­ta­neously. For ex­am­ple, we can snap to­gether a “pur­ple” gen­er­a­tive model and a “jar” gen­er­a­tive model to get a “pur­ple jar” gen­er­a­tive model. Any­way, we ex­plore the space of gen­er­a­tive mod­els, perform­ing a search with a figure-of-merit that kinda mixes self-su­per­vised learn­ing, model pre­dic­tive con­trol, and Bayesian(ish) pri­ors. Among other things, this search pro­cess in­volves some­thing at least vaguely analo­gous to mes­sage-pass­ing in a prob­a­bil­is­tic graph­i­cal model.

OK, now let’s dive into the Mar­cus et al. FAQ list:

  • Mar­cus et al.’s com­pu­ta­tion 1: “Rapid per­cep­tual clas­sifi­ca­tion”, po­ten­tially in­volv­ing “Re­cep­tive fields, pool­ing and lo­cal con­trast nor­mal­iza­tion” in the “Vi­sual sys­tem”

I think that “rapid per­cep­tual clas­sifi­ca­tion” nat­u­rally comes out of the cor­ti­cal al­gorithm, not only in the vi­sual sys­tem but also ev­ery­where else.

In terms of “rapid”, it’s worth not­ing that (1) many of the “rapid” re­sponses that hu­mans do are not done by the neo­cor­tex, (2) The cor­ti­cal mes­sage-pass­ing al­gorithm sup­pos­edly in­volves both faster, less-ac­cu­rate neu­ral path­ways (which prime the most promis­ing gen­er­a­tive mod­els), as well as slower, more-ac­cu­rate path­ways (which, for ex­am­ple, prop­erly do the “ex­plain­ing away” calcu­la­tion).

  • Mar­cus et al.’s com­pu­ta­tion 2: “Com­plex spa­tiotem­po­ral pat­tern recog­ni­tion”, po­ten­tially in­volv­ing “Bayesian be­lief prop­a­ga­tion” in “Sen­sory hi­er­ar­chies”

The mes­sage-pass­ing al­gorithm I men­tioned above is ei­ther Bayesian be­lief prop­a­ga­tion or some­thing ap­prox­i­mat­ing it. Con­tra Mar­cus et al., Bayesian be­lief prop­a­ga­tion is not just for spa­tiotem­po­ral pat­tern recog­ni­tion in the tra­di­tional sense; for ex­am­ple, to figure out what we’re look­ing at, the Bayesian anal­y­sis in­cor­po­rates not only the spa­tiotem­po­ral pat­tern of vi­sual in­put data, but also se­man­tic pri­ors from our other senses and world-model. Thus if we see a word with a smudged let­ter in the mid­dle, we “see” the smudge as the cor­rect let­ter, even when the same smudge by it­self would be am­bigu­ous.

  • Mar­cus et al.’s com­pu­ta­tion 3: “Learn­ing effi­cient cod­ing of in­puts”, po­ten­tially in­volv­ing “Sparse cod­ing” in “Sen­sory and other sys­tems”

I think that not just sen­sory in­puts but ev­ery feed­for­ward con­nec­tion in the neo­cor­tex (most of which are neo­cor­tex-to-neo­cor­tex) has to be re-en­coded into the data for­mat that the neo­cor­tex knows what to do with, i.e. differ­ent pos­si­ble for­ward in­puts cor­re­spond to stim­u­la­tion of differ­ent sparse sub­sets out of a pool of re­ceiv­ing neu­rons, wherein the spar­sity is rel­a­tively uniform, and where all the re­ceiv­ing neu­rons in the pool are stim­u­lated a similar frac­tion of the time (for effi­cient use of com­pu­ta­tional re­sources). So, Jeff Hawk­ins has a nice al­gorithm for this re-en­cod­ing pro­cess and again, I would put this (or some­thing like it) as an in­ter­fac­ing in­gre­di­ent on ev­ery feed­for­ward con­nec­tion in the neo­cor­tex.

  • Mar­cus et al.’s com­pu­ta­tion 4: “Work­ing mem­ory”, po­ten­tially in­volv­ing “Con­tin­u­ous or dis­crete at­trac­tor states in net­works” in “Pre­frontal cor­tex”

To me, the ob­vi­ous ex­pla­na­tion is that ac­tive gen­er­a­tive mod­els fade away grad­u­ally when they stop be­ing used, rather than turn­ing off abruptly. Maybe that’s wrong, or there’s more to it than that; I haven’t re­ally looked into it.

  • Mar­cus et al.’s com­pu­ta­tion 5: “De­ci­sion mak­ing”, po­ten­tially in­volv­ing “Re­in­force­ment learn­ing of ac­tion-se­lec­tion poli­cies in PFC/​BG sys­tem” and “win­ner-take-all net­works” in “pre­frontal cor­tex”

I didn’t talk about neu­ral im­ple­men­ta­tions in my post on how gen­er­a­tive mod­els are se­lected, but I think re­in­force­ment learn­ing (pro­cess (e) in that post) is im­ple­mented in the basal gan­glia. As far as I un­der­stand, the basal gan­glia just kinda listens broadly across the whole frontal lobe of the neo­cor­tex (the home of plan­ning and mo­tor con­trol), and mem­o­rizes as­so­ci­a­tions be­tween ar­bi­trary neo­cor­ti­cal pat­terns and as­so­ci­ated re­wards, and then it can give a con­fi­dence-boost to what­ever ac­tive neo­cor­ti­cal pat­tern is an­ti­ci­pated to give the high­est re­ward.

Win­ner-take-all is a com­bi­na­tion of that basal gan­glia mechanism, and the fact that gen­er­a­tive mod­els sup­press each other when they make con­tra­dic­tory pre­dic­tions.

  • Mar­cus et al.’s com­pu­ta­tion 6: “Rout­ing of in­for­ma­tion flow”, po­ten­tially in­volv­ing “Con­text-de­pen­dent tun­ing of ac­tivity in re­cur­rent net­work dy­nam­ics, shifter cir­cuits, os­cilla­tory cou­pling, mod­u­lat­ing ex­ci­ta­tion /​ in­hi­bi­tion bal­ance dur­ing sig­nal prop­a­ga­tion”, “com­mon across many cor­ti­cal ar­eas”

Rout­ing of in­for­ma­tion flow is a core part of the al­gorithm: what­ever gen­er­a­tive mod­els are ac­tive, they know where to send their pre­dic­tions (their mes­sage-pass­ing mas­sages).

I think it’s more com­pli­cated than that in prac­tice thanks to a biolog­i­cal limi­ta­tion: I think the parts of the brain that work to­gether need to be time-syn­chro­nized for some of the al­gorithms to work prop­erly, but time-syn­chro­niza­tion is im­pos­si­ble across the whole brain at once be­cause the sig­nals are so slow. So there might be some com­pli­cated neu­ral ma­chin­ery to dy­nam­i­cally syn­chro­nize differ­ent sub­re­gions of the cor­tex when ap­pro­pri­ate for the cur­rent in­for­ma­tion-rout­ing needs. I’m not sure. But any­way, that’s re­ally an im­ple­men­ta­tion de­tail, from a high-level-al­gorithm per­spec­tive.

As usual, it’s pos­si­ble that there’s more to “rout­ing of in­for­ma­tion flow” that I don’t know about.

  • Mar­cus et al.’s com­pu­ta­tion 7: “Gain con­trol”, po­ten­tially in­volv­ing “Divi­sive nor­mal­iza­tion”, “com­mon across many cor­ti­cal ar­eas”

I as­sume that di­vi­sive nor­mal­iza­tion is part of the com­mon cor­ti­cal al­gorithm; I hear it’s been ob­served all over the neo­cor­tex and even hip­pocam­pus, al­though I haven’t re­ally looked into it. Maybe it’s even im­plicit in that Jeff Hawk­ins feed­for­ward-con­nec­tion-in­ter­face al­gorithm I men­tioned above, but I haven’t checked.

  • Mar­cus et al.’s com­pu­ta­tion 8: “Se­quenc­ing of events over time”, po­ten­tially in­volv­ing “Feed-for­ward cas­cades” in “lan­guage and mo­tor ar­eas” and “se­rial work­ing mem­ory” in “pre­frontal cor­tex”

I think that ev­ery part of the cor­tex can learn se­quences; as I men­tioned, that’s part of the data struc­ture for each of the countless gen­er­a­tive mod­els built by the cor­ti­cal al­gorithm.

De­spite what Mar­cus im­plies, I think the time di­men­sion is very im­por­tant even for vi­sion, de­spite the im­pres­sion we might get from ImageNet-solv­ing CNNs. There are a cou­ple rea­sons to think that, but maybe the sim­plest is the fact that hu­mans can learn the “ap­pear­ance” of an in­her­ently dy­namic thing (e.g. a splash) just as eas­ily as we can learn the ap­pear­ance of a static image. I don’t think it’s a sep­a­rate mechanism.

(In­ci­den­tally, I started to do a deep dive into vi­sion, to see whether it re­ally needs any spe­cific pro­cess­ing differ­ent than the com­mon cor­ti­cal al­gorithm as I un­der­stand it. In par­tic­u­lar, the Dileep Ge­orge neo­cor­tex-in­spired vi­sion model has a lot of vi­sion-spe­cific stuff, but (1) some of it is stuff that could have been learned from scratch, but they put it in man­u­ally for their con­ve­nience (this claim is in the pa­per, ac­tu­ally), and (2) some of it is stuff that fits into the cat­e­gory I’m call­ing “in­nate gross wiring di­a­gram” in that block-quote at the top, and (3) some of it is just them do­ing a cou­ple things a lit­tle bit differ­ent from how the brain does it, I think. So I wound up feel­ing like ev­ery­thing seems to fit to­gether pretty well within the CCA frame­work, but I dunno, I’m still hazy on a num­ber of de­tails, and it’s easy to go wrong spec­u­lat­ing about com­pli­cated al­gorithms that I’m not ac­tu­ally cod­ing up and test­ing.)

  • Mar­cus et al.’s com­pu­ta­tion 9: “Rep­re­sen­ta­tion and trans­for­ma­tion of vari­ables”, po­ten­tially in­volv­ing “pop­u­la­tion cod­ing” or a var­i­ant in “mo­tor cor­tex and higher cor­ti­cal ar­eas”

Pop­u­la­tion cod­ing fits right in as a core part of the com­mon cor­ti­cal al­gorithm as I un­der­stand it, and as such, I think it is used through­out the cor­tex. The origi­nal FAQ table also men­tions some­thing about dot prod­ucts here, which is to­tally con­sis­tent with some of the gory de­tails of (my cur­rent con­cep­tion of) the com­mon cor­ti­cal al­gorithm. That’s be­yond the scope of this ar­ti­cle.

  • Mar­cus et al.’s com­pu­ta­tion 10: “Vari­able bind­ing”, po­ten­tially in­volv­ing “Indi­rec­tion” in “PFC /​ BG loops” or “Dy­nam­i­cally par­ti­tion­able au­toas­so­ci­a­tive net­works” or “Holo­graphic re­duced rep­re­sen­ta­tions” in “higher cor­ti­cal ar­eas”

They clar­ify later that by “vari­able bind­ing” they mean “the tran­si­tory or per­ma­nent ty­ing to­gether of two bits of in­for­ma­tion: a vari­able (such as an X or Y in alge­bra, or a place­holder like sub­ject or verb in a sen­tence) and an ar­bi­trary in­stan­ti­a­tion of that vari­able (say, a sin­gle num­ber, sym­bol, vec­tor, or word).”

I say, no prob­lem! Let’s go with a lan­guage ex­am­ple.

I’m not a lin­guist (as will be ob­vi­ous), but let’s take the sen­tence “You jump”. There is a “you” gen­er­a­tive model which (among other things) makes a strong pre­dic­tion that the “noun” gen­er­a­tive model is also ac­tive. There is a “jump” gen­er­a­tive model which (among other things) makes a strong pre­dic­tion that the “verb” gen­er­a­tive model is also ac­tive. Yet an­other gen­er­a­tive model pre­dicts that there will be a sen­tence in which a noun will be fol­lowed by a verb, with the noun be­ing the sub­ject. So you can snap all of these in­gre­di­ents to­gether into a larger gen­er­a­tive model, “You jump”. There you have it!

Again, I haven’t thought about it in any depth. At the very least, there are about a zillion other gen­er­a­tive mod­els in­volved in this pro­cess that I’m leav­ing out. But the ques­tion is, are there as­pects of lan­guage that can’t be learned by this kind of al­gorithm?

Well, some weak, in­di­rect ev­i­dence that this kind of al­gorithm can learn lan­guage is the startup Ga­malon, which tries to do nat­u­ral lan­guage pro­cess­ing us­ing prob­a­bil­is­tic pro­gram­ming with some kind of com­po­si­tional gen­er­a­tive model, and it works great. (Or so they say!) Here’s their CEO Ben Vi­goda de­scribing the tech­nol­ogy on youtube, and don’t miss their fun prob­a­bil­is­tic-pro­gram­ming draw­ing demo start­ing at 29:00. It’s weak ev­i­dence be­cause I very much doubt that Gamelon uses ex­actly the same data struc­tures and search al­gorithms as the neo­cor­tex, only vaguely similar, I think. (But I feel strongly that it’s way more similar to the neo­cor­tex than a Trans­former or RNN is, at least in the ways that mat­ter.)

Conclusion

So, hav­ing read the Mar­cus et al. pa­per and a few of its refer­ences, it re­ally didn’t move me at all away from my pre­vi­ous opinion: I still think the Com­mon Cor­ti­cal Al­gorithm /​ Cor­ti­cal Unifor­mity hy­poth­e­sis is ba­si­cally right, mod­ulo the caveats I men­tioned at the top. (That said, I wasn’t 100% con­fi­dent about that hy­poth­e­sis be­fore, and I’m still not.) If any­one finds the Mar­cus et al. pa­per more con­vinc­ing than I did, I’d love to talk about it!

No comments.