Inductive biases stick around
This post is a follow-up to Understanding “Deep Double Descent”.
I was talking to Rohin at NeurIPS about my post on double descent, and he asked the very reasonable question of why exactly I think double descent is so important. I realized that I hadn’t fully explained that in my previous post, so the goal of this post is to further address the question of why you should care about double descent from an AI safety standpoint. This post assumes you’ve read my Understanding “Deep Double Descent” post, so you should read that first before reading this if you haven’t already.
Specifically, I think double descent demonstrates the in my opinion very important yet counterintuitive result that larger models can actually be simpler than smaller models. On its face, this sounds somewhat crazy—how can a model with more parameters be simpler? But in fact I think this is just a very straightforward consequence of double descent: in the double descent paradigm, larger models with zero training error generalize better than smaller models with zero training error because they do better on SGD’s inductive biases. And if you buy that SGD’s inductive biases are approximately simplicity, that means that larger models with zero training error are simpler than smaller models with zero training error.
Obviously, larger models do have more parameters than smaller ones, so if that’s your measure of simplicity, larger models will always be more complicated, but for other measures of simplicity that’s not necessarily the case. For example, it could hypothetically be the case that larger models have lower Kolmogorov complexity. Though I don’t actually think that’s true in the case of K-complexity, I think that’s only for the boring reason that model weights have a lot of noise. If you had a way of somehow only counting the “essential complexity,” I suspect larger models would actually have lower K-complexity.
Really, what I I’m trying to do here is dispel what I see as the myth that as ML models get more powerful simplicity will stop mattering for them. In a Bayesian setting, it is a fact that the impact of your prior on your posterior (for those regions where your prior is non-zero) becomes negligible as you update on more and more data. I have sometimes heard it claimed that as a consequence of this result, as we move to doing machine learning with ever larger datasets and ever bigger models, the impact of our training processes’ inductive biases will become negligible. However, I think that’s quite wrong, and I think double descent does a good job of showing why, because all of the performance gains you get past the interpolation threshold are coming from your implicit prior. Thus, if you suspect modern ML to mostly be in that regime, what will matter in terms of which techniques beat out other techniques is how good they are at compressing their data into the “actually simplest” model that fits it.
Furthermore, even just from the simple Bayesian perspective, I suspect you can still get double descent. For example, suppose your training process looks like the following: you have some hypothesis class that keeps getting larger as you train and at each time step you select the best a posteriori hypothesis. I think that this setup will naturally yield a double descent for noisy data: first you get a “likelihood descent” as you get hypotheses with greater and greater likelihood, but then you start overfitting to noise in your data as you get close to the interpolation threshold. Past the interpolation threshold, however, you get a second “prior descent” where you’re selecting hypotheses with greater and greater prior probability rather than greater and greater likelihood. I think this is a good model for how modern machine learning works and what double descent is doing.
All of this is only for models with zero training error, however—before you reach zero training error larger models can certainly have more essential complexity than smaller ones. That being said, if you don’t do very many steps of training then your inductive biases will also matter a lot because you haven’t updated that much on your data yet. In the double descent framework, the only region where your inductive biases don’t matter very much is right on the interpolation threshold—before the interpolation threshold or past it they should still be quite relevant.
Why does any of this matter from a safety perspective, though? Ever since I read Belkin et al. I’ve had double descent as part of my talk version of “Risks from Learned Optimization” because I think it addresses a pretty important part of the story for mesa-optimization. That is, mesa-optimizers are simple, compressed policies—but as ML moves to larger and larger models, why should that matter? The answer, I think, is that larger models can generalize better not just by fitting the data better, but also by being simpler.
Note that double descent happens even without explicit regularization, so the prior we’re talking about here is the implicit one imposed by the architecture you’ve chosen and the fact that you’re training it via SGD. ↩︎
Which is exactly what you should expect if you think Occam’s razor is the right prior: if two hypotheses have the same likelihood but one generalizes better, according to Occam’s razor it must be because it’s simpler. ↩︎