Gary Marcus vs Cortical Uniformity
Background / context
I wrote about cortical uniformity last year in Human Instincts, Symbol Grounding, and the Blank Slate Neocortex. (Other lesswrong discussion includes Alex Zhu recently and Jacob Cannell in 2015.) Here was my description (lightly edited, and omitting several footnotes that were in the original):
Instead of saying that the human brain has a vision processing algorithm, motor control algorithm, language algorithm, planning algorithm, and so on, in “Common Cortical Algorithm” (CCA) theory we say that (to a first approximation) we have a massive amount of “general-purpose neocortical tissue”, and if you dump visual information into that tissue, it does visual processing, and if you connect that tissue to motor control pathways, it does motor control, etc.
CCA theory, as I’m using the term, is a simplified model. There are almost definitely a couple caveats to it:
There are sorta “hyperparameters” on the generic learning algorithm which seem to be set differently in different parts of the neocortex. For example, some areas of the cortex have higher or lower density of particular neuron types. There are other examples too. I don’t think this significantly undermines the usefulness or correctness of CCA theory, as long as these changes really are akin to hyperparameters, as opposed to specifying fundamentally different algorithms. So my reading of the evidence is that if you put, say, motor nerves coming out of visual cortex tissue, the tissue could do motor control, but it wouldn’t do it quite as well as the motor cortex does.
There is almost definitely a gross wiring diagram hardcoded in the genome—i.e., set of connections between different neocortical regions and each other, and other parts of the brain. These connections later get refined and edited during learning. Again, we can ask how much the existence of this innate gross wiring diagram undermines CCA theory. How complicated is the wiring diagram? Is it millions of connections among thousands of tiny regions, or just tens of connections among a few regions? Would the brain work at all if you started with a random wiring diagram? I don’t know for sure, but for various reasons, my current belief is that this initial gross wiring diagram is not carrying much of the weight of human intelligence, and thus that this point is not a significant problem for the usefulness of CCA theory. (This is a loose statement; of course it depends on what questions you’re asking.) I think of it more like: if it’s biologically important to learn a concept space that’s built out of associations between information sources X, Y, and Z, well, you just dump those three information streams into the same part of the cortex, and then the CCA will take it from there, and it will reliably build this concept space. So once you have the CCA nailed down, it kinda feels to me like you’re most of the way there....
Marcus et al.’s challenge
Now, when I was researching that post last year, I had read one book chapter opposed to cortical uniformity and another book chapter in favor of cortical uniformity, which were a good start, but I’ve been keeping my eye out for more on the topic. And I just found one! In 2014 Gary Marcus, Adam Marblestone, and Thomas Dean wrote a little commentary in Science Magazine called The Atoms of Neural Computation, with a case against cortical uniformity.
Out of the various things they wrote, one stands out as the most substantive and serious criticism: They throw down a gauntlet in their FAQ, with a table of 10 fundamentally different calculations that they think the neocortex does. Can one common cortical algorithm really subsume or replace all those different things?
Well, I accept the challenge!!
But first, I better say something about what there common cortical algorithm is and does, with the caveat that nobody knows all the details, and certainly not me. (The following paragraph is mostly influenced by reading a bunch of stuff by Dileep George & Jeff Hawkins, along with miscellaneous other books and papers that I’ve happened across in my totally random and incomplete neuroscience and AI self-education.)
The common cortical algorithm (according to me, and leaving out lots of aspects that aren’t essential for this post) is an algorithm that builds a bunch of generative models, each of which consists of predictions that other generative models are on or off, and/or predictions that input channels (coming from outside the neocortex—vision, hunger, etc.) are on or off. (“It’s symbols all the way down.”) All the predictions are attached to confidence values, and both the predictions and confidence values are, in general, functions of time (or of other parameters … again, I’m glossing over details here). The generative models are compositional, because if two of them make disjoint and/or consistent predictions, you can create a new model that simply predicts that both of those two component models are active simultaneously. For example, we can snap together a “purple” generative model and a “jar” generative model to get a “purple jar” generative model. Anyway, we explore the space of generative models, performing a search with a figure-of-merit that kinda mixes self-supervised learning, model predictive control, and Bayesian(ish) priors. Among other things, this search process involves something at least vaguely analogous to message-passing in a probabilistic graphical model.
OK, now let’s dive into the Marcus et al. FAQ list:
Marcus et al.’s computation 1: “Rapid perceptual classification”, potentially involving “Receptive fields, pooling and local contrast normalization” in the “Visual system”
I think that “rapid perceptual classification” naturally comes out of the cortical algorithm, not only in the visual system but also everywhere else.
In terms of “rapid”, it’s worth noting that (1) many of the “rapid” responses that humans do are not done by the neocortex, (2) The cortical message-passing algorithm supposedly involves both faster, less-accurate neural pathways (which prime the most promising generative models), as well as slower, more-accurate pathways (which, for example, properly do the “explaining away” calculation).
Marcus et al.’s computation 2: “Complex spatiotemporal pattern recognition”, potentially involving “Bayesian belief propagation” in “Sensory hierarchies”
The message-passing algorithm I mentioned above is either Bayesian belief propagation or something approximating it. Contra Marcus et al., Bayesian belief propagation is not just for spatiotemporal pattern recognition in the traditional sense; for example, to figure out what we’re looking at, the Bayesian analysis incorporates not only the spatiotemporal pattern of visual input data, but also semantic priors from our other senses and world-model. Thus if we see a word with a smudged letter in the middle, we “see” the smudge as the correct letter, even when the same smudge by itself would be ambiguous.
Marcus et al.’s computation 3: “Learning efficient coding of inputs”, potentially involving “Sparse coding” in “Sensory and other systems”
I think that not just sensory inputs but every feedforward connection in the neocortex (most of which are neocortex-to-neocortex) has to be re-encoded into the data format that the neocortex knows what to do with, i.e. different possible forward inputs correspond to stimulation of different sparse subsets out of a pool of receiving neurons, wherein the sparsity is relatively uniform, where all the receiving neurons in the pool are stimulated a similar fraction of the time (for efficient use of computational resources), where the different subsets have low overlap, etc. So, Jeff Hawkins has a nice algorithm for this re-encoding process and again, I would put this (or something like it) as an interfacing ingredient on every feedforward connection in the neocortex.
Marcus et al.’s computation 4: “Working memory”, potentially involving “Continuous or discrete attractor states in networks” in “Prefrontal cortex”
To me, the obvious explanation is that active generative models fade away gradually when they stop being used, rather than turning off abruptly. Maybe that’s wrong, or there’s more to it than that; I haven’t really looked into it.
Marcus et al.’s computation 5: “Decision making”, potentially involving “Reinforcement learning of action-selection policies in PFC/BG system” and “winner-take-all networks” in “prefrontal cortex”
I didn’t talk about neural implementations in my post on how generative models are selected, but I think reinforcement learning (process (e) in that post) is implemented in the basal ganglia. As far as I understand, the basal ganglia just kinda listens broadly across the whole frontal lobe of the neocortex (the home of planning and motor control), and memorizes associations between arbitrary neocortical patterns and associated rewards, and then it can give a confidence-boost to whatever active neocortical pattern is anticipated to give the highest reward.
Winner-take-all is a combination of that basal ganglia mechanism, and the fact that generative models suppress each other when they make contradictory predictions.
Marcus et al.’s computation 6: “Routing of information flow”, potentially involving “Context-dependent tuning of activity in recurrent network dynamics, shifter circuits, oscillatory coupling, modulating excitation / inhibition balance during signal propagation”, “common across many cortical areas”
Routing of information flow is a core part of the algorithm: whatever generative models are active, they know where to send their predictions (their message-passing massages).
I think it’s more complicated than that in practice thanks to a biological limitation: I think the parts of the brain that work together need to be time-synchronized for some of the algorithms to work properly, but time-synchronization is impossible across the whole brain at once because the signals are so slow. So there might be some complicated neural machinery to dynamically synchronize different subregions of the cortex when appropriate for the current information-routing needs. I’m not sure. But anyway, that’s really an implementation detail, from a high-level-algorithm perspective.
As usual, it’s possible that there’s more to “routing of information flow” that I don’t know about.
Marcus et al.’s computation 7: “Gain control”, potentially involving “Divisive normalization”, “common across many cortical areas”
I assume that divisive normalization is part of the common cortical algorithm; I hear it’s been observed all over the neocortex and even hippocampus, although I haven’t really looked into it. Maybe it’s even implicit in that Jeff Hawkins feedforward-connection-interface algorithm I mentioned above, but I haven’t checked.
Marcus et al.’s computation 8: “Sequencing of events over time”, potentially involving “Feed-forward cascades” in “language and motor areas” and “serial working memory” in “prefrontal cortex”
I think that every part of the cortex can learn sequences; as I mentioned, that’s part of the data structure for each of the countless generative models built by the cortical algorithm.
Despite what Marcus implies, I think the time dimension is very important even for vision, despite the impression we might get from ImageNet-solving CNNs. There are a couple reasons to think that, but maybe the simplest is the fact that humans can learn the “appearance” of an inherently dynamic thing (e.g. a splash) just as easily as we can learn the appearance of a static image. I don’t think it’s a separate mechanism.
(Incidentally, I started to do a deep dive into vision, to see whether it really needs any specific processing different than the common cortical algorithm as I understand it. In particular, the Dileep George neocortex-inspired vision model has a lot of vision-specific stuff, but (1) some of it is stuff that could have been learned from scratch, but they put it in manually for their convenience (this claim is in the paper, actually), and (2) some of it is stuff that fits into the category I’m calling “innate gross wiring diagram” in that block-quote at the top, and (3) some of it is just them doing a couple things a little bit different from how the brain does it, I think. So I wound up feeling like everything seems to fit together pretty well within the CCA framework, but I dunno, I’m still hazy on a number of details, and it’s easy to go wrong speculating about complicated algorithms that I’m not actually coding up and testing.)
Marcus et al.’s computation 9: “Representation and transformation of variables”, potentially involving “population coding” or a variant in “motor cortex and higher cortical areas”
Population coding fits right in as a core part of the common cortical algorithm as I understand it, and as such, I think it is used throughout the cortex.
Marcus et al.’s computation 10: “Variable binding”, potentially involving “Indirection” in “PFC / BG loops” or “Dynamically partitionable autoassociative networks” or “Holographic reduced representations” in “higher cortical areas”
They clarify later that by “variable binding” they mean “the transitory or permanent tying together of two bits of information: a variable (such as an X or Y in algebra, or a placeholder like subject or verb in a sentence) and an arbitrary instantiation of that variable (say, a single number, symbol, vector, or word).”
I say, no problem! Let’s go with a language example.
I’m not a linguist (as will be obvious), but let’s take the sentence “You jump”. There is a “you” generative model which (among other things) makes a strong prediction that the “noun” generative model is also active. There is a “jump” generative model which (among other things) makes a strong prediction that the “verb” generative model is also active. Yet another generative model predicts that there will be a sentence in which a noun will be followed by a verb, with the noun being the subject. So you can snap all of these ingredients together into a larger generative model, “You jump”. There you have it!
Again, I haven’t thought about it in any depth. At the very least, there are about a zillion other generative models involved in this process that I’m leaving out. But the question is, are there aspects of language that can’t be learned by this kind of algorithm?
Well, some weak, indirect evidence that this kind of algorithm can learn language is the startup Gamalon, which tries to do natural language processing using probabilistic programming with some kind of compositional generative model, and it works great. (Or so they say!) Here’s their CEO Ben Vigoda describing the technology on youtube, and don’t miss their fun probabilistic-programming drawing demo starting at 29:00. It’s weak evidence because I very much doubt that Gamelon uses exactly the same data structures and search algorithms as the neocortex, only vaguely similar, I think. (But I feel strongly that it’s way more similar to the neocortex than a Transformer or RNN is, at least in the ways that matter.)
So, having read the Marcus et al. paper and a few of its references, it really didn’t move me at all away from my previous opinion: I still think the Common Cortical Algorithm / Cortical Uniformity hypothesis is basically right, modulo the caveats I mentioned at the top. (That said, I wasn’t 100% confident about that hypothesis before, and I’m still not.) If anyone finds the Marcus et al. paper more convincing than I did, I’d love to talk about it!