The other paper that killed deep learning theory
Yesterday, I wrote about the state of deep learning theory circa 2016,[1] as well as the bombshell 2016 paper by Zhang et al. that arguably signaled its demise. Today, I cover the aftermath, and the 2019 paper that devastated deep learning theory again.
As a brief summary, I argued that the rise of deep learning posed an existential challenge to the dominant theoretical paradigm of statistical learning theory, because neural networks have a lot of complexity. The response from the field was to attempt to quantify other ways in which the hypothesis class of neural networks in practice was simple, using alternative metrics of complexity. Zhang et al. 2016 showed that the standard neural network architectures trained with standard training methods could memorize large quantities of random labelled data, which showed that no such argument could explain the generalization properties of neural networks.
Today we’re going to look at the aftermath: how did the field of deep learning theory react to this paper? What were the attempts to get around this result using data-dependent generalization bounds? And why did Nagarajan and Kolter’s humbly titled Uniform convergence may be unable to explain generalization in deep learning serve as the proverbial final nail in the coffin of this line of work?
Let’s briefly return to what exactly the Zhang et al paper showed. Yesterday, I wrote:
The authors’ results show that the same class of neural networks, trained with the same learning algorithm, can generalize when given true labels and memorize random ones. This shows that the hypothesis class of neural networks that are learnable with standard techniques cannot be simple in any useful sense, at least for complexity measures that depend only on properties of the hypothesis class and (data-independent) properties of the learning algorithm.
(emphasis added)
Notably, there was an important caveat to the results: what Zhang et al. showed was that there existed some datasets where neural networks could learn to overfit. This left open the possibility of finding data-dependent generalization bounds, based on properties of particular neural networks trained on particular datasets. For example, it remained a live possibility that a generalization bound could say, “if a trained neural network has small weight norm/is compressible/has large margin, and it was trained on sufficiently many data points, then its test error will be no more than epsilon higher than its training error.”
And that’s exactly what some of the researchers in this field did. Bartlett, Foster, and Telgarsky’s Spectrally-normalized margin bounds for neural networks (2017) introduced a notion of complexity based on the spectral norm and reference matrices (“spectral complexity”); they then used the spectral complexity and the margin of a learned neural network classifier to bound its generalization error. Concurrently, Neyshabur, Bhojanapalli, and Srebro’s A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks (2018) also used a combination of spectral norm and distance from initialization to argue for generalization, albeit in a Bayesian framework. Even Nagarajan and Kolter gestured toward their own data-dependent generalization bound in their 2017 workshop paper Generalization in Deep Networks: The Role of Distance from Initialization, where the complexity of a neural network is related to how far its weights have changed from the initialization.
Bartlett, Foster, and Telgarsky’s spectral-complexity and margin-based bound. This is not important for the piece, I’m just including it to show that there was serious math involved! Astute readers may notice that, assuming each entry in the dataset X is bounded, the scaling factor of error with respect to data is still the classic 1/sqrt(n)
The common form of these bounds is that they bound the generalization gap in terms of some spectral-norm derived complexity measure of the trained weights, divided by the margin between training data and the decision boundary and the (square root of the) number of datapoints. Neural networks trained on real labels tended to have lower spectral norms and larger margin between their decision boundaries and training data points. Thus, when given enough data points, the generalization error would be low and they should be able to generalize!
An implicit assumption of all of these bounds is that (like most bounds from statistical learning theory) they demonstrate uniform convergence – they apply regardless of what the hypothesis is. More precisely, with high probability over possible training sets, the test error and train error are close for every hypothesis h in the hypothesis class simultaneously. Specifically, such a bound has the form:
With high probability over possible training sets S, for all h in the hypothesis class, we have |expected test error of hypothesis h - empirical error of h on S|
(Some bound involving the size of the training data and high level properties of h).[2]
This type of bound is both very natural and very useful: a trained neural network is some h
It is at this entire genre of bound that Nagarajan and Kolter take aim in their 2019 paper.
So, what did Nagarajan and Kolter’s 2019 paper show?
First, they show empirically that the bounds in the literature are not only vacuous, they scale in the wrong direction. Nagarajan and Kolter are able to easily train small, 5-layer neural networks on MNIST such that there’s a margin of at least 10 (in logit space) on 99% of the training data. As you might expect, in this setting, the test error of their trained networks decreases with the number of data points n, following a power law of test error scaling approximately at
The problem is that in this setting, the complexity measures proposed in the literature also have power law relationships. Notably, the spectral norm of the learned weight matrices scales linearly in n, and the distance to initialization scales
It doesn’t say so in the caption, but Nagarajan and Kolter’s figure 1 was a devastating rejoinder to the main approach taken in learning theory following Zhang et al. And this time, I’m certain this figure took fewer flops to produce than a single press of the integrated Claude button in the LW editor.
It’s worth emphasizing again how crushing this result is. Traditionally, in learning theory, the expectation is that the generalization error scales approximately
Secondly, they construct an overparameterized linear setting where (two-sided) uniform convergence bounds provably fail. The full construction is beyond the scope of this blog post, but I think the core idea is quite elegant. Readers interested in the mathematical details are encouraged to read the paper themselves.
Consider fitting a linear classifier w on n training examples. The inputs to this classifier are K + D dimensional, where the first K dimensions contain a deterministic signal and the remaining D >> n dimensions are Gaussian noise, with zero mean and variance scaled to 1/D. After a single gradient step, a linear classifier will have the first K dimensions aligned with the signal, while the remaining D dimensions will be the sum of n independent Gaussian noise vectors
It’s easy to see why w would generalize: for each new datapoint, the signal in the first K dimensions will remain large (since it’s deterministic). In contrast, the D noise directions are sampled independently for each new datapoint, and the dot product of this noise vector with the final D dimensions of our linear classifier
To show that uniform convergence fails, the authors construct a natural “bad” dataset S’
In more mathematical terms, our linear classifier trained on S has noise-block weights ≈
For interested readers, here’s the actual mathematical construction from Nagarajan and Kolter, as well as the statement of the formal theorem that gradient descent provably does well while any uniform convergence bound is provably vacuous.
Third, they translate their theoretical overparameterized linear example into an empirical result on a shallow ReLU network. Nagarajan and Kolter construct a dataset where each data point lies on one of two 1000-dimensional hyperspheres, of radius 1 or radius 1.1.[7] They then train a 2-layer ReLU network with 100k hidden units on this dataset, until 99% of the training data is classified with margin 10. They find the standard result that test error scales approximately
They then construct a bad dataset S’ by taking their original training dataset S, and projecting the data points onto the opposite hypersphere, and swapping their labels. Their ReLU classifier consistently gets 0% accuracy on this projected dataset S’. This gives them the same results as in the overparameterized linear case, albeit in a case that is neither linear nor with overparameterized inputs.
The authors examine 2D slices of the learned decision boundary to understand why the classifer simultaneously gets both high test accuracy and 0% accuracy on the projected dataset. Note that by construction, the true classification boundary lies between the two hyperspheres.
Near training data points, the decision boundary bulges away from the example, enough that the boundary actually encompasses both hyperspheres in the vicinity of the data point. This means that projecting a data point onto the other hypersphere doesn’t change the way the network classifies it. Since the true label is flipped by this projection, the network gets the projected data point wrong. As S’ projects all of the training datapoints to the opposite hypersphere, it makes sense that the network gets 0% accuracy on S’.
In contrast, when you look random 2D slices, the decision boundary tends to lie between the two hyperspheres. The bulges induced by the test examples have “averaged out”. So for any new, randomly generated test datapoint, the network is likely to classify the example correctly.
Figure 2 from Nagarajan and Kolter 2019 shows two things. On the leftmost figure, they show that as the number of training datapoints increases, the test error decreases, but the error on the bad training dataset S’ remains constant at 1. In the two middle figures, they show that near training datapoints, the decision boundary bulges away from the train examples, leading to misclassifications in that area. However, in the rightmost figure, they show that near new, random datapoints, the decision boundary is likely to average out and lie between the two hyperspheres.
This leads the authors to speculate that this is a general phenomenon: SGD on neural networks learns classifiers that are simple on the macroscopic scale, but complex on the microscopic scale. The microscopic complexity is what stops uniform convergence bounds from working, while the macroscopic simplicity explains generalization. In the case of their their ReLU example, the exact decision boundary will bulge around the sampled training datapoints (microscopic complexity), leading to a vacuous uniform convergence bound; near new test datapoints, this complexity averages out (macroscopic simplicity), leading to good test performance.
The results from Nagarajan and Kolter are a devastating blow to post-Zhang et al. deep learning theory: not only did it demonstrate that the data-dependent bounds created by the field scaled in the wrong direction, it provided an over-parameterized setting where the entire approach taken by statistical learning theory – uniform convergence bounds – provably did not work.
A more constructive way of thinking about Nagarajan and Kolter’s work is that it added further restrictions to what results could possibly explain neural network generalization. Namely, it showed that any such result needs to be algorithm- and data-dependent in a much stronger sense than the spectral-norm bounds. It needs to give up on bounding worst-case empirical error over all hypotheses in a class. And it needs to find some way to handle the complex microscopic structures induced by SGD on neural networks without losing sight of the macroscopic regularity that actually explains generalization.
But almost a decade later, we still don’t have those results. Maybe one of you reading this will write the paper that explains generalization. In the meantime, I’ll pull an academic move, and leave the task of coming up with better deep learning theory as an exercise to the reader.
- ^
To clarify, by “deep learning theory” I mean work “extending statistical learning theory to deep neural networks trained with SGD, in order to derive generalization bounds that would explain their behavior in practice” – that is, not any theoretical approach to deep learning, but specifically the attempts to construct a classical learning theory for deep learning.
- ^
The “high level properties of h” part is how they introduce data-dependency into their bound, in order to escape the Zhang et al. 2016 result.
- ^
Each
(and ) is an independent, identical zero mean Gaussian, so the variance of the dot product scales with their variance . Summing over the independent training noise vectors, the magnitude of the contribution from the noise dimensions scales approximately as , which is since there are . - ^
The math here is the same as in the previous footnote, with
replaced by . - ^
Astute readers may notice that uniform convergence bounds only need to hold on
of the samples from the distribution. This is where the naturalness of S’ comes in: it turns out it’s not possible to exclude all possible S’ in the . - ^
Astute readers may notice that this is in the opposite direction than is considered in machine learning: instead of test error being large and train error being small, we have train error being large and test error being small. This example clearly shows that the standard two-sided uniform convergence-based generalization bounds from statistical learning theory – bounds on the absolute difference between the train and test error – can be vacuous. A natural question is, can we escape this by resorting to one-sided bounds, where we only upper bound the test error using the train error? In the appendix, Nagarajan and Kolter show how many of the so-called “one-sided” PAC-Bayes generalization bounds in the literature are lower-bounded by a two-sided uniform convergence bound. Later work has tried to derive one-sided generalization bounds that escape their argument, albeit without much practical success.
- ^
This setting was first introduced by Gilmer et al. 2018’s Adversarial Spheres to study adversarial examples.
It may not quite address many of the particulars you discuss here, but I found myself very convinced by Andrew Gordon Wilson’s Deep Learning is Not So Mysterious or Different (corresponding video), and think it’s going to be well-appreciated by those who appreciate your pair of posts.