Coming back years later to say: People in 2016 (when the Zhang et al paper was first released) did already know that neural networks were expressive (the work demonstrating neural networks with very high VC dimension occurred in the late 90s and early 2000s).
The hope at the time was not that neural networks themselves lack representativity, but that some combination of neural networks + SGD or neural networks + weight decay or something that people were doing on top of neural networks induced a strong prior against being able to fit random data points. The other important bit of context is that (as this review post demonstrates) a lot of work at the time was interested in constructing uniform bounds that worked regardless of what the true hypothesis was, as long as the hypothesis was representable by the neural network + SGD/weight decay/smooth data manifold class.
By demonstrating that standard neural network techniques could learn incredibly overfit hypotheses, Zhang et al showed that whatever the class of hypotheses learnable by deep learning was, it included things that you could not get uniform generalization bounds out of, thus invalidating the dominant approach in the field at the time.
Coming back years later to say: People in 2016 (when the Zhang et al paper was first released) did already know that neural networks were expressive (the work demonstrating neural networks with very high VC dimension occurred in the late 90s and early 2000s).
The hope at the time was not that neural networks themselves lack representativity, but that some combination of neural networks + SGD or neural networks + weight decay or something that people were doing on top of neural networks induced a strong prior against being able to fit random data points. The other important bit of context is that (as this review post demonstrates) a lot of work at the time was interested in constructing uniform bounds that worked regardless of what the true hypothesis was, as long as the hypothesis was representable by the neural network + SGD/weight decay/smooth data manifold class.
By demonstrating that standard neural network techniques could learn incredibly overfit hypotheses, Zhang et al showed that whatever the class of hypotheses learnable by deep learning was, it included things that you could not get uniform generalization bounds out of, thus invalidating the dominant approach in the field at the time.