Firstly, my claim about human learning: the update for a neuron depends only on signals from the neurons one step out [1]; the signals from more distant neurons are screened off by the signals from nearer ones. Compare to backprop, where the update on a weight depends not only on the activations of the next few layers, but on the activations of every layer down to the final output.
Therefore in humans, neurons will be in approximate local equilibria with each other (low local predictive error on average across activations). Frequent-enough predictive errors somewhere [2] will cause a change to slowly cascade (the nearest neurons change at first to approach a new local equilibrium, then when it happens again the next-nearest neurons change to match the new signals from the nearest ones, while the nearest change a little bit more to match the second occurrence of the new signals, and so on).
This means that low-salience infrequent events will never penetrate far enough upstream to correct cognitive patterns that contribute maladaptively to their predictive error, which leaves a lot of slack for un-optimized cognitive patterns far away from sensory stimuli.
[1] Or perhaps a small number of steps; maybe the brain has some sneaky tricks. But certainly not neurons twenty steps away.
[2] For an important one-time event, the brain has a trick using memory and attention: it keeps the stimuli looping in attention for long enough to write the essence of it to memory, and then accesses it enough from memory to propagate important changes to other relevant brain areas.
OK, thanks. So the deep layers in the human brain still learn, just slowly / less data-efficiently, compared to the deep layers in LLMs.
Doesn’t this prove too much though? It sounds like you are arguing that, in general, the deeper neurons in human brains need more datapoints of experience to learn anything compared to the deeper neurons in LLMs. Like, it sounds like you are saying that backprop is just a superior learning algorithm, that more quickly penetrates updates to all the deep weights compared to the more local process the brain uses.
But in practice humans seem to be more data-efficient than LLMs.
One difference is that we’re not just doing feedforward learning; one of my aforementioned hypotheses (attention [1] causes cognitive patterns far from sensory data to interact with each other, improving their coherence) points at a way that learning can effectively progress even if the connection to immediate sensory prediction grows tenuous for rare stimuli.
That’s an example of a way we could be more more sample-efficient than a feedforward learner, even if the latter ends up with some parts more ruthlessly optimized within their context.
[1] Human attention, not to be confused with transformer attention.
Firstly, my claim about human learning: the update for a neuron depends only on signals from the neurons one step out [1]; the signals from more distant neurons are screened off by the signals from nearer ones. Compare to backprop, where the update on a weight depends not only on the activations of the next few layers, but on the activations of every layer down to the final output.
Therefore in humans, neurons will be in approximate local equilibria with each other (low local predictive error on average across activations). Frequent-enough predictive errors somewhere [2] will cause a change to slowly cascade (the nearest neurons change at first to approach a new local equilibrium, then when it happens again the next-nearest neurons change to match the new signals from the nearest ones, while the nearest change a little bit more to match the second occurrence of the new signals, and so on).
This means that low-salience infrequent events will never penetrate far enough upstream to correct cognitive patterns that contribute maladaptively to their predictive error, which leaves a lot of slack for un-optimized cognitive patterns far away from sensory stimuli.
[1] Or perhaps a small number of steps; maybe the brain has some sneaky tricks. But certainly not neurons twenty steps away.
[2] For an important one-time event, the brain has a trick using memory and attention: it keeps the stimuli looping in attention for long enough to write the essence of it to memory, and then accesses it enough from memory to propagate important changes to other relevant brain areas.
OK, thanks. So the deep layers in the human brain still learn, just slowly / less data-efficiently, compared to the deep layers in LLMs.
Doesn’t this prove too much though? It sounds like you are arguing that, in general, the deeper neurons in human brains need more datapoints of experience to learn anything compared to the deeper neurons in LLMs. Like, it sounds like you are saying that backprop is just a superior learning algorithm, that more quickly penetrates updates to all the deep weights compared to the more local process the brain uses.
But in practice humans seem to be more data-efficient than LLMs.
One difference is that we’re not just doing feedforward learning; one of my aforementioned hypotheses (attention [1] causes cognitive patterns far from sensory data to interact with each other, improving their coherence) points at a way that learning can effectively progress even if the connection to immediate sensory prediction grows tenuous for rare stimuli.
That’s an example of a way we could be more more sample-efficient than a feedforward learner, even if the latter ends up with some parts more ruthlessly optimized within their context.
[1] Human attention, not to be confused with transformer attention.