Yeah I was somewhat annoyed that early SLT made such a big deal out of them. These are boring, spurious things, and another useful intuition is a rough idea (not always true, but more often than not) that “no information that requires your activation to be a ReLU and fails to work well with the approximation theorem is useful for interp”.
I recently did a deep dive into physics and SLT with PIBBSS colleague Lauren Greenspan, that I’m going to write about at some point this month. My understanding there is that there is a plausibly useful type of symmetry that you can try to think about in a Noether-esque way: this is the symmetry of a model before being initialized or seeing any data.
Namely, in the standard physics point of view, you view a choice of weights as a field (so whatever processes that happen are integrated over the prior of weight initializations in a path integral fashion) and you view input-output examples as experimental data (so the stuff that goes into the collider—the behavior on a new datapoint can be thought of as a sort of the “output” of the scattering experiment). The point is that the substrate on which physicists see symmetries happens before the symmetry breaking inherent in “performing the experiment”, i.e., training on any inputs or choosing any weights. Here the standard initialization assumption has orthogonal O(d) symmetry at every layer, for d the width (Edited to clarify: here if you have some inputs x_1, .., x_n then the probability of seeing activations y_1, .., y_n at layer d at initialization is equal to the probability of seeing activations R(y_1), .., R(y_n) for R a rotation matrix. This means that the “vacuum” prior on tuples y_1, .., y_n—which later gets “symmetry broken” via Bayesian updating or SGD—will be invariant with respect to hitting each layer of activations with a rotation matrix R). If the width is big, this is a very big symmetry group which is useful for simplifying the analysis (this is implicitly used a lot in PDLT), and I think you can also look at some Noether fields here. Of course this point of view is somewhat weak (since it’s so big-brained), but the thing I’m excited about is the possibility of applying it in a fractal fashion, where you make some coarse assumptions about your weights (that they’re “pre-learned”) that globally break symmetry, but have some local approximate symmetries. I don’t know how to see explicit Noether fields here, but it might be possible.
Yeah I was somewhat annoyed that early SLT made such a big deal out of them. These are boring, spurious things, and another useful intuition is a rough idea (not always true, but more often than not) that “no information that requires your activation to be a ReLU and fails to work well with the approximation theorem is useful for interp”.
I recently did a deep dive into physics and SLT with PIBBSS colleague Lauren Greenspan, that I’m going to write about at some point this month. My understanding there is that there is a plausibly useful type of symmetry that you can try to think about in a Noether-esque way: this is the symmetry of a model before being initialized or seeing any data.
Namely, in the standard physics point of view, you view a choice of weights as a field (so whatever processes that happen are integrated over the prior of weight initializations in a path integral fashion) and you view input-output examples as experimental data (so the stuff that goes into the collider—the behavior on a new datapoint can be thought of as a sort of the “output” of the scattering experiment). The point is that the substrate on which physicists see symmetries happens before the symmetry breaking inherent in “performing the experiment”, i.e., training on any inputs or choosing any weights. Here the standard initialization assumption has orthogonal O(d) symmetry at every layer, for d the width (Edited to clarify: here if you have some inputs x_1, .., x_n then the probability of seeing activations y_1, .., y_n at layer d at initialization is equal to the probability of seeing activations R(y_1), .., R(y_n) for R a rotation matrix. This means that the “vacuum” prior on tuples y_1, .., y_n—which later gets “symmetry broken” via Bayesian updating or SGD—will be invariant with respect to hitting each layer of activations with a rotation matrix R). If the width is big, this is a very big symmetry group which is useful for simplifying the analysis (this is implicitly used a lot in PDLT), and I think you can also look at some Noether fields here. Of course this point of view is somewhat weak (since it’s so big-brained), but the thing I’m excited about is the possibility of applying it in a fractal fashion, where you make some coarse assumptions about your weights (that they’re “pre-learned”) that globally break symmetry, but have some local approximate symmetries. I don’t know how to see explicit Noether fields here, but it might be possible.