Jesse Hoogland comments on Lucius Bushnaq’s Shortform

Jesse Hoogland 2 Jan 2025 22:42 UTC
25 points
4
I want to point out that there are many interesting symmetries that are non-global or data-dependent. These “non-generic” symmetries can change throughout training. Let me provide a few examples.
ReLU networks. Consider the computation involved in a single layer of a ReLU network:
$y = B \cdot R e L U (A \cdot x + b),$
or, equivalently,
$y_{i} = \sum j B_{i j} \times R e L U (\sum k A_{j k} x_{k} + b_{j}) .$
(Maybe we’re looking at a two-layer network where $x$ are the inputs and $y$ are the outputs, or maybe we’re at some intermediate layer where these variables represent internal activations before and after a given layer.)
Dead neuron $j^{'}$ . If one of the biases $b_{j^{'}}$ is always larger than the associated preactivation $(A \cdot x)_{j^{'}}$ , then the ReLU will always spit out a zero at that index. This “dead” neuron introduces a new continuous symmetry, where you can set the entries of column $j^{'}$ of $B$ to an arbitrary value, without affecting the network’s computation ( $B_{i j^{'}} \times 0 = 0$ ).
Bypassed neuron $j^{'}$ . Consider the opposite: if $b_{j^{'}} < (A \cdot x)_{j^{'}}$ for all possible inputs $x_{j^{'}}$ , then neuron $j^{'}$ will always activate, and the ReLU’s nonlinearity effectively vanishes at that index. This introduces a new continuous symmetry, where you can insert an arbitrary invertible transformation $T$ to the subspace of bypassed neurons between the activations and the final transformation. For the sake of clarity, assume all neurons are bypassed, then:
$\begin{matrix} y & = B \cdot R e L U (A \cdot x + b), \to B \cdot (A \cdot x + b) = (B T) \cdot ((T^{- 1} A) \cdot x + (T^{- 1} b)) = ~ B \cdot (~ A \cdot x + ~ b) . \end{matrix}$
Hidden polytopes. A ReLU network learns a piecewise linear approximation to a function. For ease, consider the case of learning a 1-dimensional mapping. It might look something like this:
The vertices between polytopes correspond to a set of constraints on the weights. Consider what happens when two neighboring linear pieces line up (left to right). One vertex becomes redundant (dotted lined). You can now move the vertex along the shared polytope without changing the function implemented. This corresponds to a continuous transformation of your weights in some direction of weight space. Importantly this is only true locally— as soon as the vertex reaches the next edge of the shared polytope, pushing it any further will change the function. Moving the vertex in any direction orthogonal to the polytope will also change the function.
- Lucius Bushnaq 3 Jan 2025 4:47 UTC
  12 points
  0
  Parent
  That’s what I meant by
  If the symmetry only holds for a particular solution in some region of the loss landscape rather than being globally baked into the architecture, the $γ$ value will still be conserved under gradient descent so long as we’re inside that region.
  ...
  One could maybe hold out hope that the conserved quantities/coordinates associated with degrees of freedom in a particular solution are sometimes more interesting, but I doubt it. For e.g. the degrees of freedom we talk about here, those invariants seem similar to the ones in the ReLU rescaling example above.
  Dead neurons are a special case of 3.1.1 (low-dimensional activations) in that paper, bypassed neurons are a special case of 3.2 (synchronised non-linearities). Hidden polytopes are a mix 3.2.2 (Jacobians spanning a low-dimensional subspace) and 3.1.1 I think. I’m a bit unsure which one because I’m not clear on what weight direction you’re imagining varying when you talk about “moving the vertex”. Since the first derivative of the function you’re approximating doesn’t actually change at this point, there’s multiple ways you could do this.