So unfortunately this is one of those arguments that rapidly descends into which prior you should apply and how you should update on what evidence, but.
Your entire post basically hinges on this point and I find it unconvincing. Bionets are very strange beasts that cannot even implement backprop in the way we’re used to, it’s not remotely obvious that we would recognize known algorithms even if they were what the cortex amounted to. I will confess that I’m not a professional neuroscientist, but Beren Millidge is and he’s written that “it is very clear that ML models have basically cracked many of the secrets of the cortex”. He knows more about neuroscience than I’m going to on any reasonable timescale so I’m happy to defer to him.
Even if this weren’t true, we have other evidence from deep learning to suggest that something like it is true in spirit. We now have several different architectures that reach parity with but do not substantially exceed transformer: RWKV (RNN), xLSTM, Mamba, Based, etc. This implies they have a shared bottleneck and most gains are from scaling. I honestly think, and I will admit this is a subject with a lot of uncertainty so I could be wrong, but I really think there’s a cognitive bias here where people will look at the deep learning transformer language model stack, which in the grand scheme of things really is very simple, and feel like it doesn’t satisfy their expectation for a “simple core of intelligence” because the blank spot in their map, their ignorance of the function of the brain (but probably not the actual function of the brain!) is simpler than the manifest known mechanisms of self attention, multi-layer perceptron, backprop and gradient descent on a large pile of raw unsorted sense data and compute. Because they’re expecting the evidence from a particular direction they say “well this deep learning thing is a hack, it doesn’t count even if it produces things that are basically sapient by any classic sci-fi definition” and go on doing epistemically wild mental gymnastics from the standpoint of an unbiased observer.
I think we can clearly conclude that cortex doesn’t do what NNs do, because cortex is incapable to learn conditioned response, it’s an uncontested fiefdom of cerebellum, while for NNs learning conditioned response is the simplest thing to do. It also crushes hypothesis of Hebbian rule. I think majority of people in neurobiology neighbourhood haven’t properly updated on this fact.
We now have several different architectures that reach parity with but do not substantially exceed transformer: RWKV (RNN), xLSTM, Mamba, Based, etc. This implies they have a shared bottleneck and most gains are from scaling.
It can also imply that shared bottleneck is a property of overall approach.
is simpler than the manifest known mechanisms of self attention, multi-layer perceptron, backprop and gradient descent
I don’t know where you get “simpler”. Description of each thing you mentioned can fit in what, paragraph, page? I don’t think that Steven expects description of “simple core of intelligence” to be shorter than paragraph with description of backprop.
Beren Millidge is and he’s written that “it is very clear that ML models have basically cracked many of the secrets of the cortex”
I guess if you look at brain at sufficiently coarse-grained level, you would discover that lots of parts of brain perform something like generalized linear regression. It would be less a fact about brain and more a fact about reality: generalized linear dependencies are everywhere, it’s useful to learn them. It’s reasonable that brain also learns what transformer learns. It doesn’t mean that it’s the only thing brain learns.
Sure “The Cerebellum Is The Seat of Classical Conditioning.” But I’m not sure it’s the only one. Delay eyeblink conditioning is cerebellar-dependent, which we know because of lesion studies. This does not generalize to all conditioned responses:
Trace eyeblink conditioning requires hippocampus and medial prefrontal cortex in addition to cerebellum (Takehara 2003).
Fear conditioning is driven by the amygdala, not cerebellum.
Hebbian plasticity isn’t crushed by cerebellar learning. Cerebellum long-term depression is a timing‐sensitive variant of Hebb’s rule (van Beugen et al. 2013).
So unfortunately this is one of those arguments that rapidly descends into which prior you should apply and how you should update on what evidence, but.
Your entire post basically hinges on this point and I find it unconvincing. Bionets are very strange beasts that cannot even implement backprop in the way we’re used to, it’s not remotely obvious that we would recognize known algorithms even if they were what the cortex amounted to. I will confess that I’m not a professional neuroscientist, but Beren Millidge is and he’s written that “it is very clear that ML models have basically cracked many of the secrets of the cortex”. He knows more about neuroscience than I’m going to on any reasonable timescale so I’m happy to defer to him.
Even if this weren’t true, we have other evidence from deep learning to suggest that something like it is true in spirit. We now have several different architectures that reach parity with but do not substantially exceed transformer: RWKV (RNN), xLSTM, Mamba, Based, etc. This implies they have a shared bottleneck and most gains are from scaling. I honestly think, and I will admit this is a subject with a lot of uncertainty so I could be wrong, but I really think there’s a cognitive bias here where people will look at the deep learning transformer language model stack, which in the grand scheme of things really is very simple, and feel like it doesn’t satisfy their expectation for a “simple core of intelligence” because the blank spot in their map, their ignorance of the function of the brain (but probably not the actual function of the brain!) is simpler than the manifest known mechanisms of self attention, multi-layer perceptron, backprop and gradient descent on a large pile of raw unsorted sense data and compute. Because they’re expecting the evidence from a particular direction they say “well this deep learning thing is a hack, it doesn’t count even if it produces things that are basically sapient by any classic sci-fi definition” and go on doing epistemically wild mental gymnastics from the standpoint of an unbiased observer.
I think we can clearly conclude that cortex doesn’t do what NNs do, because cortex is incapable to learn conditioned response, it’s an uncontested fiefdom of cerebellum, while for NNs learning conditioned response is the simplest thing to do. It also crushes hypothesis of Hebbian rule. I think majority of people in neurobiology neighbourhood haven’t properly updated on this fact.
It can also imply that shared bottleneck is a property of overall approach.
I don’t know where you get “simpler”. Description of each thing you mentioned can fit in what, paragraph, page? I don’t think that Steven expects description of “simple core of intelligence” to be shorter than paragraph with description of backprop.
I guess if you look at brain at sufficiently coarse-grained level, you would discover that lots of parts of brain perform something like generalized linear regression. It would be less a fact about brain and more a fact about reality: generalized linear dependencies are everywhere, it’s useful to learn them. It’s reasonable that brain also learns what transformer learns. It doesn’t mean that it’s the only thing brain learns.
what! big if true. what papers originated this claim for you?
Here are lots of links.
Sure “The Cerebellum Is The Seat of Classical Conditioning.” But I’m not sure it’s the only one. Delay eyeblink conditioning is cerebellar-dependent, which we know because of lesion studies. This does not generalize to all conditioned responses:
Trace eyeblink conditioning requires hippocampus and medial prefrontal cortex in addition to cerebellum (Takehara 2003).
Fear conditioning is driven by the amygdala, not cerebellum.
Hebbian plasticity isn’t crushed by cerebellar learning. Cerebellum long-term depression is a timing‐sensitive variant of Hebb’s rule (van Beugen et al. 2013).
What? This isn’t my understanding at all, and a quick check with an LLM also disputes this.