The other paper that killed deep learning theory

Yesterday, I wrote about the state of deep learning theory circa 2016,^[1] as well as the bombshell 2016 paper by Zhang et al. that arguably signaled its demise. Today, I cover the aftermath, and the 2019 paper that devastated deep learning theory again.

As a brief summary, I argued that the rise of deep learning posed an existential challenge to the dominant theoretical paradigm of statistical learning theory, because neural networks have a lot of complexity. The response from the field was to attempt to quantify other ways in which the hypothesis class of neural networks in practice was simple, using alternative metrics of complexity. Zhang et al. 2016 showed that the standard neural network architectures trained with standard training methods could memorize large quantities of random labelled data, which showed that no such argument could explain the generalization properties of neural networks.

Today we’re going to look at the aftermath: how did the field of deep learning theory react to this paper? What were the attempts to get around this result using data-dependent generalization bounds? And why did Nagarajan and Kolter’s humbly titled Uniform convergence may be unable to explain generalization in deep learning serve as the proverbial final nail in the coffin of this line of work?

Let’s briefly return to what exactly the Zhang et al paper showed. Yesterday, I wrote:

The authors’ results show that the same class of neural networks, trained with the same learning algorithm, can generalize when given true labels and memorize random ones. This shows that the hypothesis class of neural networks that are learnable with standard techniques cannot be simple in any useful sense, at least for complexity measures that depend only on properties of the hypothesis class and (data-independent) properties of the learning algorithm.

(emphasis added)

Notably, there was an important caveat to the results: what Zhang et al. showed was that there existed some datasets where neural networks could learn to overfit. This left open the possibility of finding data-dependent generalization bounds, based on properties of particular neural networks trained on particular datasets. For example, it remained a live possibility that a generalization bound could say, “if a trained neural network has small weight norm/is compressible/has large margin, and it was trained on sufficiently many data points, then its test error will be no more than epsilon higher than its training error.”

And that’s exactly what some of the researchers in this field did. Bartlett, Foster, and Telgarsky’s Spectrally-normalized margin bounds for neural networks (2017) introduced a notion of complexity based on the spectral norm and reference matrices (“spectral complexity”); they then used the spectral complexity and the margin of a learned neural network classifier to bound its generalization error. Concurrently, Neyshabur, Bhojanapalli, and Srebro’s A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks (2018) also used a combination of spectral norm and distance from initialization to argue for generalization, albeit in a Bayesian framework. Even Nagarajan and Kolter gestured toward their own data-dependent generalization bound in their 2017 workshop paper Generalization in Deep Networks: The Role of Distance from Initialization, where the complexity of a neural network is related to how far its weights have changed from the initialization.

The common form of these bounds is that they bound the generalization gap in terms of some spectral-norm derived complexity measure of the trained weights, divided by the margin between training data and the decision boundary and the (square root of the) number of datapoints. Neural networks trained on real labels tended to have lower spectral norms and larger margin between their decision boundaries and training data points. Thus, when given enough data points, the generalization error would be low and they should be able to generalize!

An implicit assumption of all of these bounds is that (like most bounds from statistical learning theory) they demonstrate uniform convergence – they apply regardless of what the hypothesis is. More precisely, with high probability over possible training sets, the test error and train error are close for every hypothesis h in the hypothesis class simultaneously. Specifically, such a bound has the form:

With high probability over possible training sets S, for all h in the hypothesis class, we have |expected test error of hypothesis h - empirical error of h on S| (Some bound involving the size of the training data and high level properties of h).^[2]

This type of bound is both very natural and very useful: a trained neural network is some h H, and uniform convergence guarantees the bound holds for the network regardless of what h it turns out to be, as long as we know some high level properties of h.

It is at this entire genre of bound that Nagarajan and Kolter take aim in their 2019 paper.

So, what did Nagarajan and Kolter’s 2019 paper show?

First, they show empirically that the bounds in the literature are not only vacuous, they scale in the wrong direction. Nagarajan and Kolter are able to easily train small, 5-layer neural networks on MNIST such that there’s a margin of at least 10 (in logit space) on 99% of the training data. As you might expect, in this setting, the test error of their trained networks decreases with the number of data points n, following a power law of test error scaling approximately at

The problem is that in this setting, the complexity measures proposed in the literature also have power law relationships. Notably, the spectral norm of the learned weight matrices scales linearly in n, and the distance to initialization scales . The result is that, even though the actual generalization error goes down, the generalization bounds in the literature go up, scaling as .

It’s worth emphasizing again how crushing this result is. Traditionally, in learning theory, the expectation is that the generalization error scales approximately ; and similar results are observed empirically with modern scaling laws. Not only are the post-Zhang et al. spectral norm bounds provided in the literature so loose as to be vacuous, they also scale as , in the wrong direction!

Secondly, they construct an overparameterized linear setting where (two-sided) uniform convergence bounds provably fail. The full construction is beyond the scope of this blog post, but I think the core idea is quite elegant. Readers interested in the mathematical details are encouraged to read the paper themselves.

Consider fitting a linear classifier w on n training examples. The inputs to this classifier are K + D dimensional, where the first K dimensions contain a deterministic signal and the remaining D >> n dimensions are Gaussian noise, with zero mean and variance scaled to 1/D. After a single gradient step, a linear classifier will have the first K dimensions aligned with the signal, while the remaining D dimensions will be the sum of n independent Gaussian noise vectors , where is the label of the th datapoint.

It’s easy to see why w would generalize: for each new datapoint, the signal in the first K dimensions will remain large (since it’s deterministic). In contrast, the D noise directions are sampled independently for each new datapoint, and the dot product of this noise vector with the final D dimensions of our linear classifier will be concentrated near zero.^[3]

To show that uniform convergence fails, the authors construct a natural “bad” dataset S’ on which the empirical error of w on S’ is far from the expected test error. Specifically, S’ is constructed by reversing the D noise dimensions in our training dataset S. The trick being exploited here is that the linear classifier trained on S has “memorized” the training noise. Flipping the noise turns the memorization against the classifier, and causes it to fail on S’, even though it generalizes well to new data points.

In more mathematical terms, our linear classifier trained on S has noise-block weights ≈ . When you evaluate w on = (signal, ), the noise-block contribution becomes .^[4] Because the variance in each dimension is 1/D, and there are D noise directions, we have . This means that, with high probability over the noise, w misclassifies every point in S’, as the noise component is both in the wrong direction, and (with some careful finetuning of the noise/signal magnitudes) large enough to overwhelm the signal. Therefore, the empirical error of w on S’ is 1, and the gap between test error and empirical error is . Since a uniform-convergence bound by definition applies to every classifier in the class on any^[5] sample (including w on S’), the bound must be no less than this term, which forces it to be vacuous: |test error—train error| .^[6]

Third, they translate their theoretical overparameterized linear example into an empirical result on a shallow ReLU network. Nagarajan and Kolter construct a dataset where each data point lies on one of two 1000-dimensional hyperspheres, of radius 1 or radius 1.1.^[7] They then train a 2-layer ReLU network with 100k hidden units on this dataset, until 99% of the training data is classified with margin 10. They find the standard result that test error scales approximately with dataset size.

They then construct a bad dataset S’ by taking their original training dataset S, and projecting the data points onto the opposite hypersphere, and swapping their labels. Their ReLU classifier consistently gets 0% accuracy on this projected dataset S’. This gives them the same results as in the overparameterized linear case, albeit in a case that is neither linear nor with overparameterized inputs.

The authors examine 2D slices of the learned decision boundary to understand why the classifer simultaneously gets both high test accuracy and 0% accuracy on the projected dataset. Note that by construction, the true classification boundary lies between the two hyperspheres.

Near training data points, the decision boundary bulges away from the example, enough that the boundary actually encompasses both hyperspheres in the vicinity of the data point. This means that projecting a data point onto the other hypersphere doesn’t change the way the network classifies it. Since the true label is flipped by this projection, the network gets the projected data point wrong. As S’ projects all of the training datapoints to the opposite hypersphere, it makes sense that the network gets 0% accuracy on S’.

In contrast, when you look random 2D slices, the decision boundary tends to lie between the two hyperspheres. The bulges induced by the test examples have “averaged out”. So for any new, randomly generated test datapoint, the network is likely to classify the example correctly.

This leads the authors to speculate that this is a general phenomenon: SGD on neural networks learns classifiers that are simple on the macroscopic scale, but complex on the microscopic scale. The microscopic complexity is what stops uniform convergence bounds from working, while the macroscopic simplicity explains generalization. In the case of their their ReLU example, the exact decision boundary will bulge around the sampled training datapoints (microscopic complexity), leading to a vacuous uniform convergence bound; near new test datapoints, this complexity averages out (macroscopic simplicity), leading to good test performance.

The results from Nagarajan and Kolter are a devastating blow to post-Zhang et al. deep learning theory: not only did it demonstrate that the data-dependent bounds created by the field scaled in the wrong direction, it provided an over-parameterized setting where the entire approach taken by statistical learning theory – uniform convergence bounds – provably did not work.

A more constructive way of thinking about Nagarajan and Kolter’s work is that it added further restrictions to what results could possibly explain neural network generalization. Namely, it showed that any such result needs to be algorithm- and data-dependent in a much stronger sense than the spectral-norm bounds. It needs to give up on bounding worst-case empirical error over all hypotheses in a class. And it needs to find some way to handle the complex microscopic structures induced by SGD on neural networks without losing sight of the macroscopic regularity that actually explains generalization.

But almost a decade later, we still don’t have those results. Maybe one of you reading this will write the paper that explains generalization. In the meantime, I’ll pull an academic move, and leave the task of coming up with better deep learning theory as an exercise to the reader.

^
To clarify, by “deep learning theory” I mean work “extending statistical learning theory to deep neural networks trained with SGD, in order to derive generalization bounds that would explain their behavior in practice” – that is, not any theoretical approach to deep learning, but specifically the attempts to construct a classical learning theory for deep learning.
^
The “high level properties of h” part is how they introduce data-dependency into their bound, in order to escape the Zhang et al. 2016 result.
^
Each (and ) is an independent, identical zero mean Gaussian, so the variance of the dot product scales with their variance . Summing over the independent training noise vectors, the magnitude of the contribution from the noise dimensions scales approximately as , which is since there are .
^
The math here is the same as in the previous footnote, with replaced by .
^
Astute readers may notice that uniform convergence bounds only need to hold on of the samples from the distribution. This is where the naturalness of S’ comes in: it turns out it’s not possible to exclude all possible S’ in the .
^
Astute readers may notice that this is in the opposite direction than is considered in machine learning: instead of test error being large and train error being small, we have train error being large and test error being small. This example clearly shows that the standard two-sided uniform convergence-based generalization bounds from statistical learning theory – bounds on the absolute difference between the train and test error – can be vacuous. A natural question is, can we escape this by resorting to one-sided bounds, where we only upper bound the test error using the train error? In the appendix, Nagarajan and Kolter show how many of the so-called “one-sided” PAC-Bayes generalization bounds in the literature are lower-bounded by a two-sided uniform convergence bound. Later work has tried to derive one-sided generalization bounds that escape their argument, albeit without much practical success.
^
This setting was first introduced by Gilmer et al. 2018’s Adversarial Spheres to study adversarial examples.