It’s not relevant to predictions about how well the learning algorithm generalises. And that’s the vastly more important factor for general capabilities.
Quite tangential to your point, but the problem with the universal approximation theorem is not just “it doesn’t address generalization” but that it doesn’t even fulfill its stated purpose: it doesn’t answer the question of why neural networks can space-efficiently approximate real-world functions, even with arbitrarily many training samples. The statement “given arbitrary resources, a neural network can approximate any function” is actually kind of trivial—it’s true not only of polynomials, sinusoids, etc, but even just a literal interpolated lookup table (if you have an astronomical space budget). It turns out the universal approximation theorem requiresexponentially many neurons (in the size of the input dimension) to work, far too much to be practical—in fact this is actually the same amount of resources a lookup table would cost. This is fine if you want to approximate a 2D function or something, but this goes nowhere to explaining why, like, even a space-efficient MNIST classifier is possible. The interesting question is, why can neural networks efficiently approximate the functions we see in practice?
(It’s a bit out of scope to fully dig into this, but I think a more sensible answer is something in the direction of “well, anything you can do efficiently on a computer, you can do efficiently in a neural network”—i.e. you can always encode polynomial-size Boolean circuits into a polynomial-size neural network. Though there are some subtleties here that make this a little more complicated than that.)
Quite tangential to your point, but the problem with the universal approximation theorem is not just “it doesn’t address generalization” but that it doesn’t even fulfill its stated purpose: it doesn’t answer the question of why neural networks can space-efficiently approximate real-world functions, even with arbitrarily many training samples. The statement “given arbitrary resources, a neural network can approximate any function” is actually kind of trivial—it’s true not only of polynomials, sinusoids, etc, but even just a literal interpolated lookup table (if you have an astronomical space budget). It turns out the universal approximation theorem requires exponentially many neurons (in the size of the input dimension) to work, far too much to be practical—in fact this is actually the same amount of resources a lookup table would cost. This is fine if you want to approximate a 2D function or something, but this goes nowhere to explaining why, like, even a space-efficient MNIST classifier is possible. The interesting question is, why can neural networks efficiently approximate the functions we see in practice?
(It’s a bit out of scope to fully dig into this, but I think a more sensible answer is something in the direction of “well, anything you can do efficiently on a computer, you can do efficiently in a neural network”—i.e. you can always encode polynomial-size Boolean circuits into a polynomial-size neural network. Though there are some subtleties here that make this a little more complicated than that.)