The work linked in this post was IMO the most important work done on understanding neural networks at the time it came out, and it has also significantly changed the way I think about optimization more generally.
That said, there’s a lot of “noise” in the linked papers; it takes some digging to see the key ideas and the data backing them up, and there’s a lot of space spent on things which IMO just aren’t that interesting at all. So, I’ll summarize the things which I consider central.
When optimizing an overparameterized system, there are many many different parameter settings which achieve optimality. Optima are not peaks, they’re ridges; there’s a whole surface on which optimal performance is achieved. In this regime, the key question is which of the many optima an optimized system actually converges to.
Here’s a kind-of-silly way to model it. First, we sample some random point in parameter space from the distribution P[θ]; in the neural network case, this is the parameter initialization. Then, we optimize: we find some new parameter values θ′ such that f(θ′) is maximized. But which of the many optimal θ′ values does our optimizer end up at? If we didn’t know anything about the details of the optimizer, one simple guess would be that θ′ is sampled from the initialization distribution, but updated on the point being optimal, i.e. θ′∼P[θ|f(θ) is maximal]=1ZI[f(θ) is maximal]P[θ] … so the net effect of randomly initializing and then optimizing is equivalent to using the initialization distribution as a prior, doing a Bayesian update on θ′ being optimal, and then sampling from that posterior.
The linked papers show that this kind-of-silly model is basically accurate. It didn’t have to be this way a priori; we could imagine that the specifics of SGD favored some points over others, so that the distribution of θ′ was not proportional to the prior. But that mostly doesn’t happen (and to the extent it does, it’s a relatively weak effect); the data shows that θ′ values are sampled roughly in proportion to their density in the prior, exactly as we’d expect from the Bayesian-update-on-optimality model.
One implication of this is that the good generalization of neural nets must come mostly from the prior, not from some bias in SGD, because bias in SGD mostly just doesn’t impact the distribution of optimized parameters values. The optimized parameter value distribution is approximately-determined by the initialization prior, so any generalization must come from that prior. And indeed, the papers also confirm that the observed generalization error lines up with what we’d expect from the Bayesian-update-on-optimality model.
For me, the most important update from this work has not been specific to neural nets. It’s about overparameterized optimization in general: we can think of overparameterized optimization as sampling from the initialization prior updated on optimality, i.e. P[θ|f(θ) is maximal]. This is a great approximation to work with analytically, and the papers here show that it is realistic for real complicated systems like SGD-trained neural nets.
The work linked in this post was IMO the most important work done on understanding neural networks at the time it came out, and it has also significantly changed the way I think about optimization more generally.
That said, there’s a lot of “noise” in the linked papers; it takes some digging to see the key ideas and the data backing them up, and there’s a lot of space spent on things which IMO just aren’t that interesting at all. So, I’ll summarize the things which I consider central.
When optimizing an overparameterized system, there are many many different parameter settings which achieve optimality. Optima are not peaks, they’re ridges; there’s a whole surface on which optimal performance is achieved. In this regime, the key question is which of the many optima an optimized system actually converges to.
Here’s a kind-of-silly way to model it. First, we sample some random point in parameter space from the distribution P[θ]; in the neural network case, this is the parameter initialization. Then, we optimize: we find some new parameter values θ′ such that f(θ′) is maximized. But which of the many optimal θ′ values does our optimizer end up at? If we didn’t know anything about the details of the optimizer, one simple guess would be that θ′ is sampled from the initialization distribution, but updated on the point being optimal, i.e.
θ′∼P[θ|f(θ) is maximal]=1ZI[f(θ) is maximal]P[θ]
… so the net effect of randomly initializing and then optimizing is equivalent to using the initialization distribution as a prior, doing a Bayesian update on θ′ being optimal, and then sampling from that posterior.
The linked papers show that this kind-of-silly model is basically accurate. It didn’t have to be this way a priori; we could imagine that the specifics of SGD favored some points over others, so that the distribution of θ′ was not proportional to the prior. But that mostly doesn’t happen (and to the extent it does, it’s a relatively weak effect); the data shows that θ′ values are sampled roughly in proportion to their density in the prior, exactly as we’d expect from the Bayesian-update-on-optimality model.
One implication of this is that the good generalization of neural nets must come mostly from the prior, not from some bias in SGD, because bias in SGD mostly just doesn’t impact the distribution of optimized parameters values. The optimized parameter value distribution is approximately-determined by the initialization prior, so any generalization must come from that prior. And indeed, the papers also confirm that the observed generalization error lines up with what we’d expect from the Bayesian-update-on-optimality model.
For me, the most important update from this work has not been specific to neural nets. It’s about overparameterized optimization in general: we can think of overparameterized optimization as sampling from the initialization prior updated on optimality, i.e. P[θ|f(θ) is maximal]. This is a great approximation to work with analytically, and the papers here show that it is realistic for real complicated systems like SGD-trained neural nets.