Thanks for the very comprehensive review on generalisation!
As a historical note / broader context, the worry about model class over-expressivity has been there in the early days of Machine Learning. There was a mistrust of large blackbox models like random forest and SVM and their unusually low test or even cross-validation loss, citing ability of the models to fit noise. Breiman frank commentary back in 2001, “Statistical Modelling: The Two Cultures”, touch on this among other worries about ML models. The success of ML has turn this worry into the generalisation puzzle. Zhang et. al. 2017 being a call to arms when DL greatly exacerbated the scale and urgency of this problem.
I agree with the post that approximation-generalisation-optimisation is very much entangled in this puzzle. The question should by why a model + training algorithm combo is likely to select generalising solutions among many other solutions (i.e. those with good training fit) and not just whether the best fit model generalise, whether particular optimisation algorithm can find it or how simple the model class is.
SLT gave the answer that models with very singular solutions + bayesian training will generalise very well. This still leaves open the questions of why DL models on “natural” datasets has very singular solutions and how much does the bayesian results tell us about DL models with SGD-type training.
Naive optimism: hopefully progress towards a strong resolution to the generalisation puzzle give us understanding enough to gain control on what kind of solutions are learned. And one day we can ask for more than generalisation, like “generalise and be safe”.
As a historical note / broader context, the worry about model class over-expressivity has been there in the early days of Machine Learning. There was a mistrust of large blackbox models like random forest and SVM and their unusually low test or even cross-validation loss, citing ability of the models to fit noise. Breiman frank commentary back in 2001, “Statistical Modelling: The Two Cultures”, touch on this among other worries about ML models. The success of ML has turn this worry into the generalisation puzzle. Zhang et. al. 2017 being a call to arms when DL greatly exacerbated the scale and urgency of this problem.
Yeah it surprises me that Zhang et al. (2018) has had the impact it did when, like you point out, the ideas have been around for so long. Deep learning theorists like Telgarsky point to it as a clear turning point.
Naive optimism: hopefully progress towards a strong resolution to the generalisation puzzle give us understanding enough to gain control on what kind of solutions are learned. And one day we can ask for more than generalisation, like “generalise and be safe”.
Coming back years later to say: People in 2016 (when the Zhang et al paper was first released) did already know that neural networks were expressive (the work demonstrating neural networks with very high VC dimension occurred in the late 90s and early 2000s).
The hope at the time was not that neural networks themselves lack representativity, but that some combination of neural networks + SGD or neural networks + weight decay or something that people were doing on top of neural networks induced a strong prior against being able to fit random data points. The other important bit of context is that (as this review post demonstrates) a lot of work at the time was interested in constructing uniform bounds that worked regardless of what the learned hypothesis was, as long as the hypothesis was representable by the neural network + SGD/weight decay/smooth data manifold class.
By demonstrating that standard neural network techniques could learn incredibly overfit hypotheses, Zhang et al showed that whatever the class of hypotheses learnable by deep learning was, it included things that you could not get uniform generalization bounds out of, thus invalidating the dominant approach in the field at the time.
Thanks for the very comprehensive review on generalisation!
As a historical note / broader context, the worry about model class over-expressivity has been there in the early days of Machine Learning. There was a mistrust of large blackbox models like random forest and SVM and their unusually low test or even cross-validation loss, citing ability of the models to fit noise. Breiman frank commentary back in 2001, “Statistical Modelling: The Two Cultures”, touch on this among other worries about ML models. The success of ML has turn this worry into the generalisation puzzle. Zhang et. al. 2017 being a call to arms when DL greatly exacerbated the scale and urgency of this problem.
I agree with the post that approximation-generalisation-optimisation is very much entangled in this puzzle. The question should by why a model + training algorithm combo is likely to select generalising solutions among many other solutions (i.e. those with good training fit) and not just whether the best fit model generalise, whether particular optimisation algorithm can find it or how simple the model class is.
SLT gave the answer that models with very singular solutions + bayesian training will generalise very well. This still leaves open the questions of why DL models on “natural” datasets has very singular solutions and how much does the bayesian results tell us about DL models with SGD-type training.
Naive optimism: hopefully progress towards a strong resolution to the generalisation puzzle give us understanding enough to gain control on what kind of solutions are learned. And one day we can ask for more than generalisation, like “generalise and be safe”.
Yeah it surprises me that Zhang et al. (2018) has had the impact it did when, like you point out, the ideas have been around for so long. Deep learning theorists like Telgarsky point to it as a clear turning point.
This I can stand behind.
Coming back years later to say: People in 2016 (when the Zhang et al paper was first released) did already know that neural networks were expressive (the work demonstrating neural networks with very high VC dimension occurred in the late 90s and early 2000s).
The hope at the time was not that neural networks themselves lack representativity, but that some combination of neural networks + SGD or neural networks + weight decay or something that people were doing on top of neural networks induced a strong prior against being able to fit random data points. The other important bit of context is that (as this review post demonstrates) a lot of work at the time was interested in constructing uniform bounds that worked regardless of what the learned hypothesis was, as long as the hypothesis was representable by the neural network + SGD/weight decay/smooth data manifold class.
By demonstrating that standard neural network techniques could learn incredibly overfit hypotheses, Zhang et al showed that whatever the class of hypotheses learnable by deep learning was, it included things that you could not get uniform generalization bounds out of, thus invalidating the dominant approach in the field at the time.