SGD’s Bias

There’s a common hypothesis that stochastic gradient descent has some kind of built-in bias toward simpler models, or models which generalize better. (It’s essentially the opposite of the NTK/​GP/​Mingard et al model.) There is also a rough intuitive argument to expect such a thing a priori: sub-structures which are only useful sometimes will be lost on the occasions when they’re not useful, whereas sub-structures which are consistently useful will be retained. Conceptually, it’s similar to the modularly varying goals model in biological evolution.

Mathematically, it’s not too hard to show that SGD does indeed have a bias, and we can even write it down explicitly given some not-too-unreasonable approximations. This post will walk through that derivation, and give some intuition for where the bias comes from.

First Idea: SGD Is Approximately Brownian

“Brownian” here refers to Brownian motion, i.e. a random walk. In other words, the path taken by SGD looks qualitatively sort of like this:

2D Brownian motion with drift up and to the right.

The key idea is that each sample used to estimate the gradient is approximately independent (in the probability sense of the word), and the estimate is an average over samples, so the net effect of several steps is approximately normally distributed. That’s the defining feature of Brownian motion. (Alternatively, we can assume that steps are approximately independent and additive, which gets us to the same place with a little more generality but also a little more work.)

Let’s put some math on that.

We use SGD to choose to minimize . Each step, we take independent samples of , use them to estimate the gradient, then take a step:

… where scales the step size. This step is an approximately-normal random variable, constructed from the IID random variables . It has mean , and variance .

To formally represent this as a Brownian motion, we declare that the amount-of-time which passes during each SGD step is , which we assume to be small (i.e. we’ll approximate things to first order in dt). Then SGD’s path can be represented as

… where is a standard Brownian motion, the analogue of a standard normal variable, and the square root is a matrix square root. (If you haven’t worked with Brownian motion before, ignore that formula and keep reading.)

Second Idea: Drift From High-Noise To Low-Noise Regions

Sometimes, the “noise” in a Brownian system is location-dependent. As an example, let’s consider the original use-case: a grain of pollen floats in water. It’s small enough to get randomly kicked around by water molecules, so its path is Brownian (and can be seen under an ordinary microscope). If the water has a temperature gradient, then the “noise” in the pollen-grain’s path will vary with its location.

When the grain of pollen is in a higher-noise region, it will be kicked around more, and move around faster, until eventually it moves into a lower-noise region. In the lower-noise region, it will be kicked around less, and move around slower, so it takes longer to leave the region. So, the pollen grain will tend to spend more time in lower-noise regions. With a noise gradient, this tendency to spend more time in lower-noise regions becomes a tendency to drift down the noise gradient.

Brownian motion with variance proportional to distance from the origin, and no explicit drift. Notice that the process spends most of its time near the origin. When it does move away from the origin, it tends to drift back reasonably quickly.

Mathematically, if is our pollen location, we can write its motion as

… for a location-dependent “drift” , “diffusion matrix” (larger in higher temperature regions), and is the standard Brownian motion again. Intuitively, the “drift” pushes along the direction , and the “diffusion” controls how fast spreads out along each direction—or at least that’s how we think about it for constant drift and diffusion. The probability distribution of evolves over time according to:

(This is the Fokker-Planck equation.)

Key thing to notice: we can re-write this as

… so acts like a drift term, just like . This noise-induced drift is nonzero only when there’s a noise gradient.

A very simple example of the math: suppose (i.e. it’s an identity matrix scaled by ). Then . So, the diffusion-gradient-induced drift is constant and along the direction.

Putting It Together

So:

  • SGD’s path is approximately-Brownian, with location-dependent noise

  • Brownian motion with location-dependent noise tends to drift down the noise gradient

In SGD, our “intended” drift is - i.e. drift down the gradient of the objective. But the location-dependent noise contributes a “bias”—a second drift term, resulting from drift down the noise-gradient. Combining the equations from the previous two sections, the noise-gradient-drift is

I don’t have much theory or evidence right now for what kinds of things that bias pushes towards (other than “regions of low gradient noise”), but having an explicit formula should help investigate that sort of question. Personally, I suspect that it pushes toward models with a modular structure reflecting the modularity structure of the environment, which is the main reason I’m interested in it.