I continue to think ‘neural nets just interpolate’ is a bad criticism. Taken literally, it’s obviously not true: nets are not exposed to anywhere near enough data points to interpolate their input space. On the other hand, if you think they are instead ‘interpolating’ in some implicit, higher-dimensional space which they project the data into, it’s not clear that this limits them in any meaningful way. This is especially true if the mapping to the implicit space is itself learned, as seems to be the case in neural networks.
Regarding the ‘Rashomon effect’, I think it’s clear that neural nets have some way of selecting relatively lower-complexity models, since there are also infinitely many possible models with good performance on the training set but terrible performance on the test set, yet the models learned reliably have good test set performance. Exactly how they do this is uncertain—other commenters have already pointed out regularization is important, but the intrinsic properties of SGD/the parameter-function mapping likely also play a key role. It’s an ongoing area of research.
It used to be thought that SGD sought out “flat minima” in the loss (minima with low curvature) which result in simpler models in terms of how compressible they are, but further studies have shown this isn’t really true.[11]]
The paper you cited does not show this. Instead, they construct some (rather unnatural) nets at sharp minima which have good generalization properties. This is completely consistent with flat minima having good generalization properties, and with SGD seeking out flat minima.
I continue to think ‘neural nets just interpolate’ is a bad criticism. Taken literally, it’s obviously not true: nets are not exposed to anywhere near enough data points to interpolate their input space. On the other hand, if you think they are instead ‘interpolating’ in some implicit, higher-dimensional space which they project the data into, it’s not clear that this limits them in any meaningful way. This is especially true if the mapping to the implicit space is itself learned, as seems to be the case in neural networks.
Regarding the ‘Rashomon effect’, I think it’s clear that neural nets have some way of selecting relatively lower-complexity models, since there are also infinitely many possible models with good performance on the training set but terrible performance on the test set, yet the models learned reliably have good test set performance. Exactly how they do this is uncertain—other commenters have already pointed out regularization is important, but the intrinsic properties of SGD/the parameter-function mapping likely also play a key role. It’s an ongoing area of research.
The paper you cited does not show this. Instead, they construct some (rather unnatural) nets at sharp minima which have good generalization properties. This is completely consistent with flat minima having good generalization properties, and with SGD seeking out flat minima.
Yeah, you’re right I was being sloppy. I just crossed it out.