Updating the Lottery Ticket Hypothesis

Epistemic status: not confident enough to bet against someone who’s likely to understand this stuff.

The lottery ticket hypothesis of neural network learning (as aptly described by Daniel Kokotajlo) roughly says:

When the network is randomly initialized, there is a sub-network that is already decent at the task. Then, when training happens, that sub-network is reinforced and all other sub-networks are dampened so as to not interfere.

This is a very simple, intuitive, and useful picture to have in mind, and the original paper presents interesting evidence for at least some form of the hypothesis. Unfortunately, the strongest forms of the hypothesis do not seem plausible—e.g. I doubt that today’s neural networks already contain dog-recognizing subcircuits at initialization. Modern neural networks are big, but not that big. (See this comment for some clarification of this claim.)

Meanwhile, a cluster of research has shown that large neural networks approximate certain Bayesian models, involving phrases like “neural tangent kernel (NTK)” or “Gaussian process (GP)”. Mingard et al. show that these models explain the large majority of the good performance we see from large neural networks in practice. This view also implies a version of the lottery ticket hypothesis, but it has different implications for what the “lottery tickets” are. They’re not subcircuits of the initial net, but rather subcircuits of the parameter tangent space of the initial net.

This post will sketch out what that means.

Let’s start with the jargon: what’s the “parameter tangent space” of a neural net? Think of the network as a function $f$ with two kinds of inputs: parameters $θ$ , and data inputs $x$ . During training, we try to adjust the parameters so that the function sends each data input $x^{(n)}$ to the corresponding data output $y^{(n)}$ - i.e. find $θ$ for which $y^{(n)} = f (x^{(n)}, θ)$ , for all $n$ . Each data point gives an equation which $θ$ must satisfy, in order for that data input to be exactly mapped to its target output. If our initial parameters $θ_{0}$ happen to be close enough to a solution to those equations, then we can (approximately) solve this using a linear approximation: we look for $Δ θ$ such that

$y^{(n)} = f (x^{(n)}, θ_{0}) + Δ θ \cdot \frac{d f}{d θ} (x^{(n)}, θ_{0})$

The right-hand-side of that equation is essentially the parameter tangent space. More precisely, (what I’m calling) the parameter tangent space at $θ_{0}$ is the the set of functions $F (x)$ of the form

$F (x) = f (x, θ_{0}) + Δ θ \cdot \frac{d f}{d θ} (x, θ_{0})$

… for some $Δ θ$ .

In other words: the parameter tangent space is the set of functions which can be written as linear approximations (with respect to the parameters) of the network.

The main empirical finding which led to the NTK/GP/Mingard et al picture of neural nets is that, in practice, that linear approximation works quite well. As neural networks get large, their parameters change by only a very small amount during training, so the overall $Δ θ$ found during training is actually nearly a solution to the linearly-approximated equations.

Major upshot of all this: the space-of-models “searched over” during training is approximately just the parameter tangent space.

At initialization, we randomly choose $θ_{0}$ , and that determines the parameter tangent space—that’s our set of “lottery tickets”. The SGD training process then solves the equations—it picks out the lottery tickets which perfectly match the data. In practice, there will be many such lottery tickets—many solutions to the equations—because modern nets are extremely overparameterized. SGD effectively picks one of them at random (that’s one of the main results of the Mingard et al work).

Summary:

The “parameter tangent space” of a network is the set of functions which can be written as linear approximations (with respect to the parameters) of the network.
The parameter tangent space at the network’s randomly-chosen initial parameters is roughly the set of “lottery tickets”.
SGD (effectively) throws out any lottery tickets which don’t perfectly match the data, then randomly picks one of the remaining tickets.

Of course this brushes some things under the rug—e.g. different “lottery tickets” don’t have exactly the same weight, and different architectures may have different type signatures. But if you find the original lottery ticket hypothesis to be a useful mental model, than I expect this to generally be an upgrade to that mental model. It maintains most of the conceptual functionality, but is probably more realistic.

Thankyou to Evan, Ajeya, Rohin, Edouard, and TurnTrout for a discussion which led to this post.