Why I stopped being into basin broadness

tailcalled25 Apr 2024 20:47 UTC

14 points

There was a period where everyone was really into basin broadness for measuring neural network generalization. This mostly stopped being fashionable, but I’m not sure if there’s enough written up on why it didn’t do much, so I thought I should give my take for why I stopped finding it attractive. This is probably a repetition of what others have found, but I thought I might as well repeat it.

Let’s say we have a neural network $f_{w} (x) : R^{n}$ . We evaluate it on a dataset $(x, y) \sim D$ using a loss function $L (^y, y) : R$ , to find an optimum $w^{*} = arg {min}_{w} E_{(x, y) \sim D} [L (f_{w} (x), y)]$ . Then there was an idea going around that the Hessian matrix (i.e. the second derivative of $E_{(x, y) \sim D} [L (f_{w} (x), y)]$ at $w^{*}$ ) would tell us something about $w^{*}$ (especially about how well it generalizes).

If we number the dataset $(x_{i}, y_{i})$ , we can stack all the network outputs ${^y}_{i} (w) = f_{w} (x_{i})$ which fits into an empirical loss $^L (^y) = \frac{1}{n} \sum_{i = 1}^{n} L ({^y}_{i}, y_{i})$ . The Hessian that we talked about before is now just the Hessian of $^L (^y (w))$ . Expanding this out is kind of clunky since it involves some convoluted tensors that I don’t know any syntax for, but clearly it consists of two terms:

The Hessian of $^L$ with a pair of the Jacobian of $^y$ on each end (this can just barely be written without crazy tensors: $(J_{w}^y (w))^{T} (H_{^y}^L (^y)) ∣_{^y (w)} J_{w}^y (w)$ )
The gradient of $^L$ with a crazy second derivative of $^y$ .

Now, the derivatives of $^L$ are “obviously boring” because they don’t really refer to the neural network weights, which is confirmed if you think about it in concrete cases, e.g. if $L (^y, y) = - y log (^y) - (1 - y) log (1 -^y)$ with $y = 1$ or $y = 0$ , the derivatives just quantify how far $^y$ is from $y$ . This obviously isn’t relevant for neural network generalization, except in the sense that it tells you which direction you want to generalize in.

Meanwhile, $J_{w}^y (w)$ is incredibly strongly related to neural network generalization, because it’s literally a matrix which specifies how the neural network outputs change in response the weights. In fact, it forms the core of the neural tangent kernel (a standard tool for modelling neural network generalization), because the NTK can be expressed as $J_{w}^y (w) (J_{w}^y (w))^{T}$ .

The “crazy second derivative of $^y$ ” can I guess be understood separately for each ${^y}_{i}$ , as then it’s just the Hessian $H_{w} {^y}_{i} (w)$ , i.e. it reflects how changes in the weights interact with each other when influencing ${^y}_{i}$ . I don’t have any strong opinions on how important this matrix is, though because $J_{w}^y (w)$ is so obviously important, I haven’t felt like granting $H_{w} {^y}_{i} (w)$ much attention.

The NTK as the network activations?

Epistemic status: speculative, I really should get around to verifying it. Really the prior part is speculative too, but I think those speculations are more theoretically well-grounded. But if I’m wrong with either, please call me a dummy in the comments so I can correct.

Let’s take the simplest case of a linear network, $f_{w} (x) = w^{T} x$ . In this case, $J_{w}^y (w) = x^{T}$ , i.e. the Jacobian is literally just the inputs to the network. If you work out a bunch of other toy examples, the takeaway is qualitatively similar (the Jacobian is closely related to the neuron activations), though not exactly the same.

There are of course some exceptions, e.g. $f_{a, b} (x) = a b x$ at $a = b = 0$ just has a zero Jacobian. Exceptions this extreme are probably rare, but more commonly you could have some softmax in the network (e.g. in an attention layer) which saturates such that no gradient goes through. In that case for e.g. interpretability, it seems like you’d often still really want to “count” this, so arguably the activations would be better than the NTK for this case. (I’ve been working on a modification to the NTK to better handle this case.)

The NTK and the network activations have somewhat different properties and so it switches which one I consider most relevant. However, my choice tends to be more driven by analytical convenience (e.g. the NTK and the network activations lie in different vector spaces) than by anything else.

tailcalled25 Apr 2024 20:47 UTC

14 points

3 comments2 min readLW link

Interpretability (ML & AI)AI

Alexander Gietelink Oldenziel 26 Apr 2024 19:57 UTC
4 points
1
This is all answered very elegantly by singular learning theory.

You seem to have a strong math background! I really encourage you take the time and really study the details of SLT. :-)
- tailcalled 27 Apr 2024 14:15 UTC
  2 points
  0
  Parent
  Do you have ab outline of how SLT answers this?
  - Alexander Gietelink Oldenziel 27 Apr 2024 16:05 UTC
    2 points
    0
    Parent
    ingular Sure! I’ll try and say some relevant things below. In general, I suggest looking at Liam Carroll’s distillation over Watanabe’s book (which is quite heavy going, but good as a reference text). There are also some links below that may prove helpful.
    The empirical loss and its second derivative are statistical estimator of the population loss and its second derivative. Ultimately the latter controls the properties of the former (though the relation between the second derivative of the empirical loss and the second derivative of the population loss is a little subtle).
    The [matrix of] second derivatives of the population loss at the minima is called the Fischer information metric. It’s always degenerate [i.e. singular] for any statistical model with hidden states or hierarchichal structure. Analyses that don’t take this into account are inherently flawed.
    SLT tells us that the local geometry around the minimum nevertheless controls the learning and generalization behaviour of any Bayesian learner for large N. N doesn’t have to be that large though, empirically the asymptotic behaviour that SLT predicts is already hit for N=200.
    In some sense, SLT says that the broad basin intuition is broadly correct but this needs to be heavily caveated. Our low-dimensional intuition for broad basin is misleading. For singular statistical models (again everything used in ML is highly singular) the local geometry around the minima in high dimensions is very weird.
    Maybe you’ve heard of the behaviour of the volume of a sphere in high dimensions: most of it is contained on the shell. I like to think of the local geometry as some sort of fractal sea urchin. Maybe you like that picture, maybe you don’t but it doesn’t matter. SLT gives actual math that is provably the right thing for a Bayesian learner.
    [real ML practice isn’t Bayesian learning though? Yes, this is true. Nevertheless, there is both empirical and mathematical evidence that the Bayesian quantitites are still highly relevant for actual learning]
    SLT says that the Bayesian posterior is controlled by the local geometry of the minimum. The dominant factor for N~>= 200 is the fractal dimension of the minimum. This is the RLCT and it is the most important quantity of SLT.
    There are some misconception about the RLCT floating around. One way to think about is as an ‘effective fractal dimension’ but one has to be careful about this. There is a notion of effective dimension in the standard ML literature where one takes the parameter count and mods out parameters that don’t do anything (because of symmetries). The RLCT picks up on symmetries but it is not just that. It picks up on how degenerate directions in the fischer information metric are ~= how broad is the basin in that direction.
    Let’s consider a maximally simple example to get some intuition. Let the population loss function be $L (w) = w^{2 k}$ . The number of parameters $d = 1$ and the minimum is at $w = 0$ .
    For $k = 1$ the minimum is nondegenerate (the second derivative is nonzero). In this case the RLCT is $λ = \frac{d}{2}$ half the dimension. In our case the dimension is just $1$ so $λ = \frac{1}{2}$
    For $k > 1$ the minimum is degenerate (the second derivative is zero). Analyses based on studying the second derivatives will not see the difference between $k = 2, 3, 4, 5, 6, . . .$ but in fact the local geometry is vastly different. The higher $k$ is the broader the basin around the minimum. The RLCT for $L (w) = w^{2 k}$ is $λ = \frac{1}{2 k}$ . This means, the $λ$ is lower the ‘broader’ the basin is.
    Okay so far this only recapitulates the broad basin story. But there are some important points
    this is an actual quantity that can be estimated at scale for real networks that provably dominates the learning behaviour for moderately large $N$ .
    SLT says that the minima with low rlct will be preferred. It evens says how much they will be preferred. There is tradeoff between lower rlct minima with moderate loss (‘simpler solutions’) and minima with higher rlct but lower loss. As This means that the RLCT is actually ‘the right notion of model complexity/ simplicty’ in the parameterized Bayesian setting. This is too much to recap in this comment but I refer you to Hoogland & van Wingerden’s post here. This is the also the start of the phase transition story which I regard as the principal insight of SLT.
    The RLCT doesn’t just pick up on basin broadness. It also picks up on more elaborate singular structure. E.g. a crossing valley type minimum like $L (w_{1}, w_{2}) = w_{1}^{4} w_{2}^{6}$ . I won’t tell you the answer but you can calculate it yourself using Shaowei Lin’s cheat sheet. This is key—actual neural networks have highly highly singular structure that determines the RLCT.
    The RLCT is the most important quantity in SLT but SLT is not just about the RLCT. For instance, the second most important quantity the ‘singular fluctuation’ is also quite important. It has a strong influence on generaliztion behaviour and is the largest factor in the variance of trained models. It controls approximation to Bayesian learning like the way neural networks are trained.
    We’ve seen that the directions defined by the matrix of second derivatives is fundamentally flawed because neural networks are highly singular. Still, there is something noncrazy about studying these directions. There is upcoming work which I can’t discuss in detail yet that explains to large degree how to correct this naive picture both mathematically and empirically.