Lucius Bushnaq comments on Lucius Bushnaq’s Shortform

Lucius Bushnaq 2 Jan 2025 11:10 UTC
75 points
4
PSA: The conserved quantities associated with symmetries of neural network loss landscapes seem mostly boring.
If you’re like me, then after you heard that neural network loss landscapes have continuous symmetries, you thought: “Noether’stheorem says every continuous symmetry of the action corresponds to a conserved quantity, like how energy and momentum conservation are implied by translation symmetry and angular momentum conservation is implied by rotation symmetry. Similarly, if loss functions of neural networks can have continuous symmetries, these ought to be associated with quantities that stay conserved under gradient descent^[1]!”
This is true. But these conserved quantities don’t seem to be insightful the way energy and momentum in physics are. They basically turn out to just be a sort of coordinate basis for the directions along which the loss is flat.
If our network has a symmetry such that there is an abstract coordinate $γ$ along which we can vary the parameters without changing the loss, then the gradient with respect to that coordinate will be zero. So, whatever $γ$ value we started with from random initalisation will be the value we stay at. Thus, the $γ$ value is a “conserved quantity” under gradient descent associated with the symmetry. If the symmetry only holds for a particular solution in some region of the loss landscape rather than being globally baked into the architecture, the $γ$ value will still be conserved under gradient descent so long as we’re inside that region.
For example, let’s look at a simple global symmetry: In a ReLU network, we can scale all the weights going into a neuron by some positive constant $a$ , and scale all the weights going out of the neuron by $1 / a$ , without changing what the network is doing. So, if we have a neuron with one ingoing weight $w_{1}$ initalised to $w_{1} = 2$ and one outgoing weight $w_{2}$ initalised to $w_{2} = 2$ , then the weight gradient in the direction ${^e}_{1} - {^e}_{2}$ of those two weights will be zero. Meaning our network will keep having $w_{1} = w_{2}$ all throughout training. If we’d started from a different initalisation, like $w_{1} = 2, w_{2} = 1$ , we’d instead have zero weight gradient along the direction $2 {^e}_{1} - {^e}_{2}$ . So whatever hyperbola defined by $w_{1}^{2} - w_{2}^{2}$ we start on, we’ll stay on it throughout training, assuming no fancy add-ons like weight decay.^[2]
If this doesn’t seem very insightful, I think that’s because it isn’t. It might be useful to keep in mind for bookkeeping purposes if you’re trying to do some big calculation related to learning dynamics, but it doesn’t seem to yield much insight into anything to do with model internals on the conceptual level. One could maybe hold out hope that the conserved quantities/coordinates associated with degrees of freedom in a particular solution are sometimes more interesting, but I doubt it. For e.g. the degrees of freedom we talk about here, those invariants seem similar to the ones in the ReLU rescaling example above.

I’d guess this is because in physics, different starting values of conserved quantities often correspond to systems with very different behaviours, so they contain a lot of relevant information. A ball of gas with high energy and high angular momentum behaves very differently than a ball of gas with low energy and low angular momentum. Whereas adjacent neural network parameter configurations connected by some symmetry that get the same loss correspond precisely to models that behave basically the same way.

I’m writing this up so next time someone asks me about investigating this kind of thing, I’ll have something to link them to.
1. ^
  Well, idealised gradient descent where learning rates are infinitesimally small, at least.
2. ^
  See this paper which Micurie helpfully linked me. Also seems like a good resource in general if you find yourself needing to muck around with these invariants for some calculation.
- Jesse Hoogland 2 Jan 2025 22:42 UTC
  25 points
  4
  Parent
  I want to point out that there are many interesting symmetries that are non-global or data-dependent. These “non-generic” symmetries can change throughout training. Let me provide a few examples.
  ReLU networks. Consider the computation involved in a single layer of a ReLU network:
  $y = B \cdot R e L U (A \cdot x + b),$
  or, equivalently,
  $y_{i} = \sum j B_{i j} \times R e L U (\sum k A_{j k} x_{k} + b_{j}) .$
  (Maybe we’re looking at a two-layer network where $x$ are the inputs and $y$ are the outputs, or maybe we’re at some intermediate layer where these variables represent internal activations before and after a given layer.)
  Dead neuron $j^{'}$ . If one of the biases $b_{j^{'}}$ is always larger than the associated preactivation $(A \cdot x)_{j^{'}}$ , then the ReLU will always spit out a zero at that index. This “dead” neuron introduces a new continuous symmetry, where you can set the entries of column $j^{'}$ of $B$ to an arbitrary value, without affecting the network’s computation ( $B_{i j^{'}} \times 0 = 0$ ).
  Bypassed neuron $j^{'}$ . Consider the opposite: if $b_{j^{'}} < (A \cdot x)_{j^{'}}$ for all possible inputs $x_{j^{'}}$ , then neuron $j^{'}$ will always activate, and the ReLU’s nonlinearity effectively vanishes at that index. This introduces a new continuous symmetry, where you can insert an arbitrary invertible transformation $T$ to the subspace of bypassed neurons between the activations and the final transformation. For the sake of clarity, assume all neurons are bypassed, then:
  $\begin{matrix} y & = B \cdot R e L U (A \cdot x + b), \to B \cdot (A \cdot x + b) = (B T) \cdot ((T^{- 1} A) \cdot x + (T^{- 1} b)) = ~ B \cdot (~ A \cdot x + ~ b) . \end{matrix}$
  Hidden polytopes. A ReLU network learns a piecewise linear approximation to a function. For ease, consider the case of learning a 1-dimensional mapping. It might look something like this:
  The vertices between polytopes correspond to a set of constraints on the weights. Consider what happens when two neighboring linear pieces line up (left to right). One vertex becomes redundant (dotted lined). You can now move the vertex along the shared polytope without changing the function implemented. This corresponds to a continuous transformation of your weights in some direction of weight space. Importantly this is only true locally— as soon as the vertex reaches the next edge of the shared polytope, pushing it any further will change the function. Moving the vertex in any direction orthogonal to the polytope will also change the function.
  - Lucius Bushnaq 3 Jan 2025 4:47 UTC
    12 points
    0
    Parent
    That’s what I meant by
    If the symmetry only holds for a particular solution in some region of the loss landscape rather than being globally baked into the architecture, the $γ$ value will still be conserved under gradient descent so long as we’re inside that region.
    ...
    One could maybe hold out hope that the conserved quantities/coordinates associated with degrees of freedom in a particular solution are sometimes more interesting, but I doubt it. For e.g. the degrees of freedom we talk about here, those invariants seem similar to the ones in the ReLU rescaling example above.
    Dead neurons are a special case of 3.1.1 (low-dimensional activations) in that paper, bypassed neurons are a special case of 3.2 (synchronised non-linearities). Hidden polytopes are a mix 3.2.2 (Jacobians spanning a low-dimensional subspace) and 3.1.1 I think. I’m a bit unsure which one because I’m not clear on what weight direction you’re imagining varying when you talk about “moving the vertex”. Since the first derivative of the function you’re approximating doesn’t actually change at this point, there’s multiple ways you could do this.
- Tahp 2 Jan 2025 12:47 UTC
  17 points
  13
  Parent
  Thank you. As a physicist, I wish I had an easy way to find papers which say “I tried this kind of obvious thing you might be considering and nothing interesting happened.”
- Dmitry Vaintrob 2 Jan 2025 22:35 UTC
  8 points
  0
  Parent
  Yeah I was somewhat annoyed that early SLT made such a big deal out of them. These are boring, spurious things, and another useful intuition is a rough idea (not always true, but more often than not) that “no information that requires your activation to be a ReLU and fails to work well with the approximation theorem is useful for interp”.
  
  I recently did a deep dive into physics and SLT with PIBBSS colleague Lauren Greenspan, that I’m going to write about at some point this month. My understanding there is that there is a plausibly useful type of symmetry that you can try to think about in a Noether-esque way: this is the symmetry of a model before being initialized or seeing any data.
  
  Namely, in the standard physics point of view, you view a choice of weights as a field (so whatever processes that happen are integrated over the prior of weight initializations in a path integral fashion) and you view input-output examples as experimental data (so the stuff that goes into the collider—the behavior on a new datapoint can be thought of as a sort of the “output” of the scattering experiment). The point is that the substrate on which physicists see symmetries happens before the symmetry breaking inherent in “performing the experiment”, i.e., training on any inputs or choosing any weights. Here the standard initialization assumption has orthogonal O(d) symmetry at every layer, for d the width (Edited to clarify: here if you have some inputs x_1, .., x_n then the probability of seeing activations y_1, .., y_n at layer d at initialization is equal to the probability of seeing activations R(y_1), .., R(y_n) for R a rotation matrix. This means that the “vacuum” prior on tuples y_1, .., y_n—which later gets “symmetry broken” via Bayesian updating or SGD—will be invariant with respect to hitting each layer of activations with a rotation matrix R). If the width is big, this is a very big symmetry group which is useful for simplifying the analysis (this is implicitly used a lot in PDLT), and I think you can also look at some Noether fields here. Of course this point of view is somewhat weak (since it’s so big-brained), but the thing I’m excited about is the possibility of applying it in a fractal fashion, where you make some coarse assumptions about your weights (that they’re “pre-learned”) that globally break symmetry, but have some local approximate symmetries. I don’t know how to see explicit Noether fields here, but it might be possible.
- Razied 2 Jan 2025 20:18 UTC
  6 points
  0
  Parent
  More insightful than what is conserved under the scaling symmetry of ReLU networks is what is not conserved: the gradient. Scaling $w_{1}$ by $α$ scales $\partial E / \partial w_{1}$ by $1 / α$ and $\partial E / \partial w_{2}$ by $α$ , which means that we can obtain arbitrarily large gradient norms by simply choosing small enough $α$ . And in general bad initializations can induce large imbalances in how quickly the parameters on either side of the neuron learn.
  Some time ago I tried training some networks while setting these symmetries to the values that would minimize the total gradient norm, effectively trying to distribute the gradient norm as equally as possible throughout the network. This significantly accelerated learning, and allowed extremely deep (100+ layers) networks to be trained without residual layers. This isn’t that useful for modern networks because batchnorm/layernorm seems to effectively do the same thing, and isn’t dependent on having ReLU as the activation function.
  
  Thus, the γ value is a “conserved quantity” under gradient descent associated with the symmetry. If the symmetry only holds for a particular solution in some region of the loss landscape rather than being globally baked into the architecture, the γ value will still be conserved under gradient descent so long as we’re inside that region.
  Minor detail, but this is false in practice because we are doing gradient descent with a non-zero learning rate, so there will be some diffusion between different hyperbolas in weight space as we take gradient steps of finite size.
  - Lucius Bushnaq 2 Jan 2025 20:20 UTC
    4 points
    0
    Parent
    Minor detail, but this is false in practice because we are doing gradient descent with a non-zero learning rate, so there will be some diffusion between different hyperbolas in weight space as we take gradient steps of finite size.
    See footnote 1.