I have not dug into the math in the paper yet, but the surprising thing from my current perspective is: backprop is basically for supervised learning, while Hebbian learning is basically for unsupervised learning. In particular, Hebbian learning has been touted as an (inefficient but biologically plausible) algorithm for PCA. How can you chain a bunch of PCAs together and get gradient descent?
Aside from that, here’s what I understood from the paper so far.
By predictive coding, they basically mean: take the structure of the computation graph (eg, the structure of the NN) and interpret it as a gaussian bayes net instead.
Calculating learning using only local information follows, therefore, from the general fact that bayes nets let you efficiently compute gradient descent (and some other rules, such as the EM algorithm) using only local information, so you don’t have to perform automatic differentiation on the whole network.
So the local computation of gradient descent isn’t surprising: it’s standard for graphical models, it’s just unusual for NNs. This is one reason why graphical models might be a better model of the brain than artificial neural networks.
The contribution of this paper is the nice correspondence between Gaussian bayes nets and NN backprop. I’m not really sure this should be exciting. It’s not like it’s useful for anything. If we were really excited about local learning rules, well, we already had some.
Maybe the tremendous success of backprop lends some fresh credibility to bayes nets due to this correspondence. IE, maybe we are supposed to make an inference like: “I know backprop on NNs can be super effective, so I draw the lesson that learning for bayes nets (at least Gaussian bayes nets) can also be super effective, at a 100x slowdown.” But this should have already been plausible, I claim. The machine learning community didn’t really put bayes nets and NNs side by side and find bayes nets horribly lacking in learning capacity. Rather, I think the 100x slowdown was the primary motivator: bayes nets eliminate the need for an extra automatic differentiation step, but at the cost of a more expensive inference algorithm.
In particular, someone might take this as evidence that the brain uses Gaussian networks in particular, because we now know Gaussian approximates backprop, and we know backprop is super effective. I think this would be a mistaken inference: I don’t think this provides much evidence that Gaussian bayes nets are especially intelligent compared to other Bayes nets.
On the other hand, the simplicity of the math for the Gaussian case does provide some evidence. It seems more plausible that the brain uses Gaussian bayes nets than, say, particle filters.
the surprising thing from my current perspective is: backprop is basically for supervised learning, while Hebbian learning is basically for unsupervised learning
That’s either poetic analogy or factual error.. For example autoencoders belongs to unsupervised learning and are trained through backprop.
I don’t think my reasoning was particularly strong there, but the point is less “how can you use gradient descent, a supervised-learning tool, to get unsupervised stuff????” and more “how can you use Hebbian learning, an unsupervised-learning tool, to get supervised stuff????”
Autoencoders transform unsupervised learning into supervised learning in a specific way (by framing “understand the structure of the data” as “be able to reconstruct the data from a smaller representation”).
But the reverse is much less common. EG, it would be a little weird to apply clustering (an unsupervised learning technique) to a supervised task. It would be surprising to find out that doing so was actually equivalent to some pre-existing supervised learning tool. (But perhaps not as surprising as I was making it out to be, here.)
“how can you use Hebbian learning, an unsupervised-learning tool, to get supervised stuff????”
Thanks for the precision. I guess the key insight for this is: they’re both Turing complete.
“be able to reconstruct the data from a smaller representation”
Doesn’t this sound like the thalamus includes a smaller representation than the cortices?
it would be a little weird to apply clustering (an unsupervised learning technique) to a supervised task.
Actually this is one form a feature engineering., I ’m confident you can find many examples on kaggle! Yes, you’re most probably right this is telling something important, like it’s telling something important that in some sense all NP-complete problems are arguably the same problem.
I have not dug into the math in the paper yet, but the surprising thing from my current perspective is: backprop is basically for supervised learning, while Hebbian learning is basically for unsupervised learning. In particular, Hebbian learning has been touted as an (inefficient but biologically plausible) algorithm for PCA. How can you chain a bunch of PCAs together and get gradient descent?
Aside from that, here’s what I understood from the paper so far.
By predictive coding, they basically mean: take the structure of the computation graph (eg, the structure of the NN) and interpret it as a gaussian bayes net instead.
Calculating learning using only local information follows, therefore, from the general fact that bayes nets let you efficiently compute gradient descent (and some other rules, such as the EM algorithm) using only local information, so you don’t have to perform automatic differentiation on the whole network.
So the local computation of gradient descent isn’t surprising: it’s standard for graphical models, it’s just unusual for NNs. This is one reason why graphical models might be a better model of the brain than artificial neural networks.
The contribution of this paper is the nice correspondence between Gaussian bayes nets and NN backprop. I’m not really sure this should be exciting. It’s not like it’s useful for anything. If we were really excited about local learning rules, well, we already had some.
Maybe the tremendous success of backprop lends some fresh credibility to bayes nets due to this correspondence. IE, maybe we are supposed to make an inference like: “I know backprop on NNs can be super effective, so I draw the lesson that learning for bayes nets (at least Gaussian bayes nets) can also be super effective, at a 100x slowdown.” But this should have already been plausible, I claim. The machine learning community didn’t really put bayes nets and NNs side by side and find bayes nets horribly lacking in learning capacity. Rather, I think the 100x slowdown was the primary motivator: bayes nets eliminate the need for an extra automatic differentiation step, but at the cost of a more expensive inference algorithm.
In particular, someone might take this as evidence that the brain uses Gaussian networks in particular, because we now know Gaussian approximates backprop, and we know backprop is super effective. I think this would be a mistaken inference: I don’t think this provides much evidence that Gaussian bayes nets are especially intelligent compared to other Bayes nets.
On the other hand, the simplicity of the math for the Gaussian case does provide some evidence. It seems more plausible that the brain uses Gaussian bayes nets than, say, particle filters.
That’s either poetic analogy or factual error.. For example autoencoders belongs to unsupervised learning and are trained through backprop.
I don’t think my reasoning was particularly strong there, but the point is less “how can you use gradient descent, a supervised-learning tool, to get unsupervised stuff????” and more “how can you use Hebbian learning, an unsupervised-learning tool, to get supervised stuff????”
Autoencoders transform unsupervised learning into supervised learning in a specific way (by framing “understand the structure of the data” as “be able to reconstruct the data from a smaller representation”).
But the reverse is much less common. EG, it would be a little weird to apply clustering (an unsupervised learning technique) to a supervised task. It would be surprising to find out that doing so was actually equivalent to some pre-existing supervised learning tool. (But perhaps not as surprising as I was making it out to be, here.)
Thanks for the precision. I guess the key insight for this is: they’re both Turing complete.
Doesn’t this sound like the thalamus includes a smaller representation than the cortices?
Actually this is one form a feature engineering., I ’m confident you can find many examples on kaggle! Yes, you’re most probably right this is telling something important, like it’s telling something important that in some sense all NP-complete problems are arguably the same problem.