It wasn’t obvious beforehand that this would work! You’d expect that if your function approximator has more parameters than you have example input–output pairs, it would overfit, implementing a complicated function that reproduced the example input–output pairs but outputted crazy nonsense for other choices of
x—the more expressive function approximator proving useless for the lack of evidence to pin down the correct approximation.
And that is what we see for function approximators with only slightly more parameters than example input–output pairs, but for sufficiently large function approximators, the trend reverses and “generalization” improves—the more expressive function approximator proving useful after all, as it admits algorithmically simpler functions that fit the example pairs.
One way I think of this that makes it more intuitive:
You can understand mean squared error as implying that your regression equation has a Gaussian-distributed error term \(\varepsilon\):
\[Y = \beta X + \varepsilon\]
In simple statistical models, this \(\varepsilon\) latent is not explicitly estimated, but instead eliminated by pumping up the sample size so much that it can be averaged away. This only works when you have far fewer parameters than data points.
Overparameterized models, on the other hand, use the “texture” of the datapoint (as in, rare, “high-frequency” “noise” characteristics that allow you to relatively uniquely pick out the datapoint from the dataset) to identify the datapoint, and shove the \(\varepsilon\) term into that “texture”. (You can see this explicitly in 1D regression problems, where the overparameterized models have a sharp deviation from the trend line at each data point.)
I think this works under quite general conditions. I vaguely remember there was a paper that showed that if you train a NN on random labels, then it “learns to memorize” in some sense (can’t remember what sense exactly but I think it was something like, it could more quickly be trained on a different random labelling). I suspect this is because it learns to emphasize the “textures” of the datapoints in the training set, finding features that more unqiuely distinguish datapoints in the future.
Real NNs obviously also learn real structure, not just \(\varepsilon\), but I assume they do some mixture where at least part of what they learn is just arbitrary features that distinguish unique datapoints.
One way I think of this that makes it more intuitive:
You can understand mean squared error as implying that your regression equation has a Gaussian-distributed error term \(\varepsilon\):
\[Y = \beta X + \varepsilon\]
In simple statistical models, this \(\varepsilon\) latent is not explicitly estimated, but instead eliminated by pumping up the sample size so much that it can be averaged away. This only works when you have far fewer parameters than data points.
Overparameterized models, on the other hand, use the “texture” of the datapoint (as in, rare, “high-frequency” “noise” characteristics that allow you to relatively uniquely pick out the datapoint from the dataset) to identify the datapoint, and shove the \(\varepsilon\) term into that “texture”. (You can see this explicitly in 1D regression problems, where the overparameterized models have a sharp deviation from the trend line at each data point.)
I think this works under quite general conditions. I vaguely remember there was a paper that showed that if you train a NN on random labels, then it “learns to memorize” in some sense (can’t remember what sense exactly but I think it was something like, it could more quickly be trained on a different random labelling). I suspect this is because it learns to emphasize the “textures” of the datapoints in the training set, finding features that more unqiuely distinguish datapoints in the future.
Real NNs obviously also learn real structure, not just \(\varepsilon\), but I assume they do some mixture where at least part of what they learn is just arbitrary features that distinguish unique datapoints.