One of the specific cases of feature learning mystery is MLP being able to learn sparse parities, i.e. output is XOR of some k bits of the input which is n bits in total, and MLP is able to learn this in close to O(n^k), which is actually the computational limit here. In this paper they give a very nice intuition (Section 4.1) about why even in a network with a single layer (and ReLU on top of it) gradients will contain some information about the solution.
TLDR: Gradient of “ReLU of the sum of incoming activations”, if we consider incoming weights all being one (that’s the example they study), is just a majority function. And an interesting property of majority function is that contribution of k-wise feature interactions to it decays exponentially with k. And because of this decay gradients end up being informative.
There was also an interesting paper about limitations of gradient optimization, which I can’t find now. One of the examples there was a task which is basically a XOR of k simple image classification problems (something about detecting angle of a line from the image). And they show that without intermediate target (i.e. target for a single classification task instead of XOR of all of them) it stops learning around k=4, while with intermediate target it can go up to k=20. Which is to say that even in cases where there is a compact structure in the problem, neural networks (and maybe any gradient-based models) will not always be able to find it quickly.
Thank you for bringing this up, the story of learning dynamics is a fascinating one and something I didn’t get the time to treat properly in the post. There really is (at least in theory) quite a substantial gap between Bayesian learning (what Solomonoff uses) and SGD learning.
A favorite example of mine is the task of learning pseudorandom functions: these cannot be learned in polynomial time, else you could distinguish pseudorandom numbers from random numbers and break cryptography. So because SGD training runs in polynomial time, it’s impossible for SGD training to find a pseudorandom function easily. Bayesian learning (which does not run in polynomial time) on the other hand would find the pseudorandom function quite quickly, because you can literally hardcode a psuedorandom generator into the weights of a sufficiently large neural network (Bayes is sort of “omnipotent” over the entire parameter space).
As you mention, learning parities are another great example of a task which is easy for Bayes but hard for SGD (though the reason is different from pseudorandom functions). There is a highly rich literature here, including the paper you linked by Barak which is a favorite of mine. Another personal favorite is the later paper on leap complexity, which shows how the structure of these computational obstacles may give SGD learning a “hysteresis” or “history dependent” effect—learning earlier things shapes what things you learn later, something which doesn’t happen at all for Bayes.
One of the specific cases of feature learning mystery is MLP being able to learn sparse parities, i.e. output is XOR of some k bits of the input which is n bits in total, and MLP is able to learn this in close to O(n^k), which is actually the computational limit here. In this paper they give a very nice intuition (Section 4.1) about why even in a network with a single layer (and ReLU on top of it) gradients will contain some information about the solution. TLDR: Gradient of “ReLU of the sum of incoming activations”, if we consider incoming weights all being one (that’s the example they study), is just a majority function. And an interesting property of majority function is that contribution of k-wise feature interactions to it decays exponentially with k. And because of this decay gradients end up being informative.
There was also an interesting paper about limitations of gradient optimization, which I can’t find now. One of the examples there was a task which is basically a XOR of k simple image classification problems (something about detecting angle of a line from the image). And they show that without intermediate target (i.e. target for a single classification task instead of XOR of all of them) it stops learning around k=4, while with intermediate target it can go up to k=20. Which is to say that even in cases where there is a compact structure in the problem, neural networks (and maybe any gradient-based models) will not always be able to find it quickly.
Thank you for bringing this up, the story of learning dynamics is a fascinating one and something I didn’t get the time to treat properly in the post. There really is (at least in theory) quite a substantial gap between Bayesian learning (what Solomonoff uses) and SGD learning.
A favorite example of mine is the task of learning pseudorandom functions: these cannot be learned in polynomial time, else you could distinguish pseudorandom numbers from random numbers and break cryptography. So because SGD training runs in polynomial time, it’s impossible for SGD training to find a pseudorandom function easily. Bayesian learning (which does not run in polynomial time) on the other hand would find the pseudorandom function quite quickly, because you can literally hardcode a psuedorandom generator into the weights of a sufficiently large neural network (Bayes is sort of “omnipotent” over the entire parameter space).
As you mention, learning parities are another great example of a task which is easy for Bayes but hard for SGD (though the reason is different from pseudorandom functions). There is a highly rich literature here, including the paper you linked by Barak which is a favorite of mine. Another personal favorite is the later paper on leap complexity, which shows how the structure of these computational obstacles may give SGD learning a “hysteresis” or “history dependent” effect—learning earlier things shapes what things you learn later, something which doesn’t happen at all for Bayes.