A simple sketch of the role data structure plays in loss landscape degeneracy.
The RLCT[1] is a function of both q(x) and p(x|θ). The role of p(x|θ) is clear enough, with very intuitive examples[2] of local degeneracy arising from the structure of the parameter function map. However until recently the intuitive role of q(x) really eluded me.
I think I now have some intuitive picture of how structure in q(x) influences RLCT (at least particular instances of it). Consider the following example.
Toy Example: G-invariant distribution, G-equivariant submodule
Suppose the true distribution is (1) realizable (p(⋅|θ∗)=q(⋅) for some θ∗), (2) invariant under some group action, q(x)=q(gx)∀x. Now, suppose that the model class is that of exponential models, i.e. p(x|w)∝exp(⟨θ,T(x)⟩). In particular, suppose that T, the fixed feature map, is G-equivariant, i.e.∃ρ:G→GL(Rd) such that T(gx)=ρ(g)T(x).
Claim: There is a degeneracy of the form p(x|θ∗)=p(x|ρ(g)∗(θ∗)), and in particular if G is a Lie group, the rank upper bound of RLCT decreases by 14dimG.
This is nothing nontrivial. The first claim is an immediate consequence of the definitions:
p(⋅|θ∗)=q(⋅) and q(x)=q(gx) implies p(x|θ∗)=p(gx|θ∗)∀x
Then, we have the following: p(gx∣θ∗)=exp(⟨θ∗,T(gx)⟩)=exp(⟨θ∗,ρ(g)T(x)⟩)=exp(⟨ρ∗(g)(θ∗),T(x)⟩)=p(x∣ρ(g)∗(θ∗)).
… and the latter claim on RLCT is a consequence of p(x|θ∗)=p(x|ρ(g)∗(θ∗)) reducing the rank of L(θ) at θ∗ by dimG together with the rank upper bound result here.
High-level idea: Emulability of input symmetry
While this model is very toy, I think the high-level idea for which this a concrete model of is interesting: Abstracting out, the proof of how data structure influence degeneracy routes through two steps:
The true distribution has some structure / symmetry, say, q(x)=q(x+δx)∀x (with δx as a function of x, indicating some infinitesimal change; all of this is meant to be taken heuristically), which gets imparted onto p(⋅|θ∗) by realizability, i.e. p(x|θ∗)=p(x+δx|θ∗)∀x.
Emulatability: At θ∗, the model can “emulate” certain classes of perturbations to certain classes of input x by instead perturbing the parameters, i.e. p(x+δx|θ∗)=p(x|θ∗+δθ).[3]
Basically, (1) realizablity imparts input-symmetry to p(⋅|θ∗), and (2) emulatability essentially “push-forwards” this to a symmetry in the parameters[4]. I think this is very interesting!
Story: Suppose I am tasked with image segmentation, but my visual cortex is perturbed by δθ, causing me to perceive colors with a slightly different hue. Then, if my visual cortex wasn’t perturbed but rather the world’s color shifted to that hue i.e. δx, then I would virtually not notice anything and be making the same predictions p(x+δx|θ∗)=p(x|θ∗+δθ).
Going back to the exponential model, the most unrealistic part of it (even after taking into account that it is a toy instantiation of this high-level idea) is the fact that its symmetry is generic: p(gx|θ)=p(x|ρ(g)∗(θ)) holds for ALL θ, since the G-equivariant T is independent of θ. A more realistic model would look something like p(x|w)∝exp(⟨θ1,Tθ2(x)⟩) where T also depends on θ2 and importantly, whether T satisfies G-equivariance depends on the value of θ2.
Then, if pθ∗=pθ′∗=q but θ∗ makes TG-equivariant while θ′∗ doesn’t, then the rank upper bound of the RLCT for the former is lower than that of the latter (thus θ∗ would be represented much more greatly in the Bayesian posterior).
This is more realistic, and I think sheds some light on why training imparts models with circuits / algorithms / internal symmetries that reflect structure in the data.
(Thanks to Dan Murfet for various related discussions.)
Very brief SLT context: In SLT, the main quantity of interest is RLCT, which broadly speaking is a measure of degeneracy of the most degenerate point among the optimal parameters. We care about this because it directly controls the asymptotics of the Bayesian posterior. Also, we often care about its localized version where we restrict the parameter space W to an infinitesimal neighborhood (germ) of a particular optimal parameter we’re interested in measuring the degeneracy of.
RLCT is a particular invariant of the average log likelihood function L(θ)=∫q(x)logp(x|θ)dx, meaning it is a function of the true distribution q(x) and the parametric model p(x|θ) (the choice of the prior φ(θ) doesn’t matter under reasonable regularity conditions).
Given a two layer feedforward network with ReLU, multiply the first layer by α and dividing the next by α implements the same function. Many other examples, including non-generic degeneracies which occur at particular weight values unlike the constant multiplication degeneracy which occurs at every θ; more examples in Liam Carroll’s thesis.
Let the input-side symmetry to be trivial (i.e. δx=0), and we recover degeneracies originating from the structure of the parameter-function map alone as a special case.
In my experience, the main issue with this kind of thing is finding really central examples of symmetries in the input that are emulatable. There’s a couple easy ones, like low rank[1] structure, but I never really managed to get a good argument for why generic symmetries in the data would often be emulatable[2] in real life.[3]
You might want to chat with Owen Lewis about this. He’s been thinking about connections between input symmetries and mechanistic structure for a while, and was interested in figuring out some kind of general correspondence between input symmetries and parameter symmetries.
If q(x) only depends on a low-rank subspace of the inputs x, there will usually[4] be degrees of freedom in the weights that connect to that input vector. The same is true of the hidden activations, if they’re low rank, we get a corresponding number of free weights. See e.g. section 3.1.2 here.
For a while I was hoping that almost any kind of input symmetry would tend to correspond to low-rank structure in the hidden representations of p(x|Θ∗), if p(.) has the sort of architecture used by modern neural networks. Then, almost any kind of symmetry would be reducible to the low-rank structure case[2], and hence almost any symmetry would be emulatable.
But I never managed to show this, and I no longer think it is true.
There are a couple of necessary conditions for this of course. E.g. the architecture p(.) needs to actually use weight matrices, like neural networks do.
There’s a couple easy ones, like low rank structure, but I never really managed to get a good argument for why generic symmetries in the data would often be emulatable in real life.
Right, I expect emulability to be a specific condition enabled by a particular class of algorithms that a NN might implement, rather than a generic one that is satisfied by almost all weights of a given NN architecture[1]. Glad to hear that you’ve thought about this before, I’ve also been trying to find a more general setting to formalize this argument beyond the toy exponential model.
Maybe this can help decompose the LLC into finer quantities based on where the degeneracy rises from—eg a given critical point’s LLC might come solely from the degeneracy in the parameter-function map, some from one of the multiple groups that the true distribution is invariant under at order r, other from an interaction of several groups, etc (sort of Mobius-like inversion)
And perhaps it’s possible to distinguish / measure these LLC components experimentally by measuring how the LLC changes as you perturb the true distribution q(x) by introducing new / destroying existing symmetries (susceptibilites-style).
This is more about how I conceptually think they should be (since my motivation is to use their non-genericity to argue why certain algorithms should be favored over others), and there are probably interesting exceptions of symmetries that are generically emulatable due to properties of the NN architecture (eg depth).
A simple sketch of the role data structure plays in loss landscape degeneracy.
The RLCT[1] is a function of both q(x) and p(x|θ). The role of p(x|θ) is clear enough, with very intuitive examples[2] of local degeneracy arising from the structure of the parameter function map. However until recently the intuitive role of q(x) really eluded me.
I think I now have some intuitive picture of how structure in q(x) influences RLCT (at least particular instances of it). Consider the following example.
Toy Example: G-invariant distribution, G-equivariant submodule
Suppose the true distribution is (1) realizable (p(⋅|θ∗)=q(⋅) for some θ∗), (2) invariant under some group action, q(x)=q(gx)∀x. Now, suppose that the model class is that of exponential models, i.e. p(x|w)∝exp(⟨θ,T(x)⟩). In particular, suppose that T, the fixed feature map, is G-equivariant, i.e.∃ρ:G→GL(Rd) such that T(gx)=ρ(g)T(x).
Claim: There is a degeneracy of the form p(x|θ∗)=p(x|ρ(g)∗(θ∗)), and in particular if G is a Lie group, the rank upper bound of RLCT decreases by 14dimG.
This is nothing nontrivial. The first claim is an immediate consequence of the definitions:
p(⋅|θ∗)=q(⋅) and q(x)=q(gx) implies p(x|θ∗)=p(gx|θ∗)∀x
Then, we have the following:
p(gx∣θ∗)=exp(⟨θ∗,T(gx)⟩)=exp(⟨θ∗,ρ(g)T(x)⟩)=exp(⟨ρ∗(g)(θ∗),T(x)⟩)=p(x∣ρ(g)∗(θ∗)).
… and the latter claim on RLCT is a consequence of p(x|θ∗)=p(x|ρ(g)∗(θ∗)) reducing the rank of L(θ) at θ∗ by dimG together with the rank upper bound result here.
High-level idea: Emulability of input symmetry
While this model is very toy, I think the high-level idea for which this a concrete model of is interesting: Abstracting out, the proof of how data structure influence degeneracy routes through two steps:
The true distribution has some structure / symmetry, say, q(x)=q(x+δx)∀x (with δx as a function of x, indicating some infinitesimal change; all of this is meant to be taken heuristically), which gets imparted onto p(⋅|θ∗) by realizability, i.e. p(x|θ∗)=p(x+δx|θ∗)∀x.
Emulatability: At θ∗, the model can “emulate” certain classes of perturbations to certain classes of input x by instead perturbing the parameters, i.e. p(x+δx|θ∗)=p(x|θ∗+δθ).[3]
Basically, (1) realizablity imparts input-symmetry to p(⋅|θ∗), and (2) emulatability essentially “push-forwards” this to a symmetry in the parameters[4]. I think this is very interesting!
Story: Suppose I am tasked with image segmentation, but my visual cortex is perturbed by δθ, causing me to perceive colors with a slightly different hue. Then, if my visual cortex wasn’t perturbed but rather the world’s color shifted to that hue i.e. δx, then I would virtually not notice anything and be making the same predictions p(x+δx|θ∗)=p(x|θ∗+δθ).
Going back to the exponential model, the most unrealistic part of it (even after taking into account that it is a toy instantiation of this high-level idea) is the fact that its symmetry is generic: p(gx|θ)=p(x|ρ(g)∗(θ)) holds for ALL θ, since the G-equivariant T is independent of θ. A more realistic model would look something like p(x|w)∝exp(⟨θ1,Tθ2(x)⟩) where T also depends on θ2 and importantly, whether T satisfies G-equivariance depends on the value of θ2.
Then, if pθ∗=pθ′∗=q but θ∗ makes T G-equivariant while θ′∗ doesn’t, then the rank upper bound of the RLCT for the former is lower than that of the latter (thus θ∗ would be represented much more greatly in the Bayesian posterior).
This is more realistic, and I think sheds some light on why training imparts models with circuits / algorithms / internal symmetries that reflect structure in the data.
(Thanks to Dan Murfet for various related discussions.)
Very brief SLT context: In SLT, the main quantity of interest is RLCT, which broadly speaking is a measure of degeneracy of the most degenerate point among the optimal parameters. We care about this because it directly controls the asymptotics of the Bayesian posterior. Also, we often care about its localized version where we restrict the parameter space W to an infinitesimal neighborhood (germ) of a particular optimal parameter we’re interested in measuring the degeneracy of.
RLCT is a particular invariant of the average log likelihood function L(θ)=∫q(x)logp(x|θ)dx, meaning it is a function of the true distribution q(x) and the parametric model p(x|θ) (the choice of the prior φ(θ) doesn’t matter under reasonable regularity conditions).
Given a two layer feedforward network with ReLU, multiply the first layer by α and dividing the next by α implements the same function. Many other examples, including non-generic degeneracies which occur at particular weight values unlike the constant multiplication degeneracy which occurs at every θ; more examples in Liam Carroll’s thesis.
This reminds me of the notion of data-program equivalence (programs-as-data, Gödel numbering, UTM). Perhaps some infinitesimal version of it?
Let the input-side symmetry to be trivial (i.e. δx=0), and we recover degeneracies originating from the structure of the parameter-function map alone as a special case.
In my experience, the main issue with this kind of thing is finding really central examples of symmetries in the input that are emulatable. There’s a couple easy ones, like low rank[1] structure, but I never really managed to get a good argument for why generic symmetries in the data would often be emulatable[2] in real life.[3]
You might want to chat with Owen Lewis about this. He’s been thinking about connections between input symmetries and mechanistic structure for a while, and was interested in figuring out some kind of general correspondence between input symmetries and parameter symmetries.
If q(x) only depends on a low-rank subspace of the inputs x, there will usually[4] be degrees of freedom in the weights that connect to that input vector. The same is true of the hidden activations, if they’re low rank, we get a corresponding number of free weights. See e.g. section 3.1.2 here.
Good name for this concept by the way, thanks.
For a while I was hoping that almost any kind of input symmetry would tend to correspond to low-rank structure in the hidden representations of p(x|Θ∗), if p(.) has the sort of architecture used by modern neural networks. Then, almost any kind of symmetry would be reducible to the low-rank structure case[2], and hence almost any symmetry would be emulatable.
But I never managed to show this, and I no longer think it is true.
There are a couple of necessary conditions for this of course. E.g. the architecture p(.) needs to actually use weight matrices, like neural networks do.
Right, I expect emulability to be a specific condition enabled by a particular class of algorithms that a NN might implement, rather than a generic one that is satisfied by almost all weights of a given NN architecture[1]. Glad to hear that you’ve thought about this before, I’ve also been trying to find a more general setting to formalize this argument beyond the toy exponential model.
Other related thoughts[2]:
Maybe this can help decompose the LLC into finer quantities based on where the degeneracy rises from—eg a given critical point’s LLC might come solely from the degeneracy in the parameter-function map, some from one of the multiple groups that the true distribution is invariant under at order r, other from an interaction of several groups, etc (sort of Mobius-like inversion)
And perhaps it’s possible to distinguish / measure these LLC components experimentally by measuring how the LLC changes as you perturb the true distribution q(x) by introducing new / destroying existing symmetries (susceptibilites-style).
This is more about how I conceptually think they should be (since my motivation is to use their non-genericity to argue why certain algorithms should be favored over others), and there are probably interesting exceptions of symmetries that are generically emulatable due to properties of the NN architecture (eg depth).
Some of these ideas were motivated following a conversation with Fernando Rosas.