Thank you for bringing this up, the story of learning dynamics is a fascinating one and something I didn’t get the time to treat properly in the post. There really is (at least in theory) quite a substantial gap between Bayesian learning (what Solomonoff uses) and SGD learning.
A favorite example of mine is the task of learning pseudorandom functions: these cannot be learned in polynomial time, else you could distinguish pseudorandom numbers from random numbers and break cryptography. So because SGD training runs in polynomial time, it’s impossible for SGD training to find a pseudorandom function easily. Bayesian learning (which does not run in polynomial time) on the other hand would find the pseudorandom function quite quickly, because you can literally hardcode a psuedorandom generator into the weights of a sufficiently large neural network (Bayes is sort of “omnipotent” over the entire parameter space).
As you mention, learning parities are another great example of a task which is easy for Bayes but hard for SGD (though the reason is different from pseudorandom functions). There is a highly rich literature here, including the paper you linked by Barak which is a favorite of mine. Another personal favorite is the later paper on leap complexity, which shows how the structure of these computational obstacles may give SGD learning a “hysteresis” or “history dependent” effect—learning earlier things shapes what things you learn later, something which doesn’t happen at all for Bayes.
Thank you for bringing this up, the story of learning dynamics is a fascinating one and something I didn’t get the time to treat properly in the post. There really is (at least in theory) quite a substantial gap between Bayesian learning (what Solomonoff uses) and SGD learning.
A favorite example of mine is the task of learning pseudorandom functions: these cannot be learned in polynomial time, else you could distinguish pseudorandom numbers from random numbers and break cryptography. So because SGD training runs in polynomial time, it’s impossible for SGD training to find a pseudorandom function easily. Bayesian learning (which does not run in polynomial time) on the other hand would find the pseudorandom function quite quickly, because you can literally hardcode a psuedorandom generator into the weights of a sufficiently large neural network (Bayes is sort of “omnipotent” over the entire parameter space).
As you mention, learning parities are another great example of a task which is easy for Bayes but hard for SGD (though the reason is different from pseudorandom functions). There is a highly rich literature here, including the paper you linked by Barak which is a favorite of mine. Another personal favorite is the later paper on leap complexity, which shows how the structure of these computational obstacles may give SGD learning a “hysteresis” or “history dependent” effect—learning earlier things shapes what things you learn later, something which doesn’t happen at all for Bayes.