But in practice, SGD is extremely good at optimizing NNs, and the local optima issue isn’t a huge problem.
That’s not even true. In practice, it’s the best we’ve got, but it’s still terrible in most interesting settings (or else you could solve NP-hard problems in practice, which you can’t).
As to why we can have decent machine learning and not AGI, I don’t know.
It’s because the neural net algorithms are not even close to finding the optimal neural net in complex situations.
Approximating SI isn’t sufficient for one, you need to act on the models you find.
That’s trivial to do. It’s not the problem here.
Everything approximates Bayesian inference, it’s just a matter of how ideal the approximation is.
This might be true in some sense, but not in a meaningful one. PAC learning, for instance, is fundamentally non-Bayesian. Saying that PAC learning approximates Bayesian inference is the same as saying that Bayesian inference approximates PAC learning. It’s not a very meaningful statement.
People on LW tend to be hard-core Bayesians who have never even heard of PAC learning, which is an entire branch of learning theory. I find it rather strange.
People here seem to really like Solomonoff induction, but I don’t think it’s all that relevant to learning in practice due to computational complexity.
Solomonoff induction is not computable. Trying to “approximate” it, by coming up with hypotheses similar to it, is probably also not computable.
If you replace Solomonoff induction with induction over programs that halt quickly or induction over circuits, it becomes computable, but it is still NP-hard. Again, approximating this is probably also NP-hard, depending on your definition of approximation.
Next, if you replace boolean circuits with neural nets, it is still hard to find the best neural net to fit the data. MCMC and gradient descent only find local optima. I mean, the fact that neural nets didn’t give us strong AI back in the 70s demonstrates that they are not doing anything close to Solomonoff induction.
It’s not even clear that a learning program must approximate Bayesian inference. There are things like PAC learning that don’t do that at all.