A simple sketch of the role data structure plays in loss landscape degeneracy.
The RLCT[1] is a function of both q(x) and p(x|θ). The role of p(x|θ) is clear enough, with very intuitive examples[2] of local degeneracy arising from the structure of the parameter function map. However until recently the intuitive role of q(x) really eluded me.
I think I now have some intuitive picture of how structure in q(x) influences RLCT (at least particular instances of it). Consider the following example.
Toy Example: G-invariant distribution, G-equivariant submodule
Suppose the true distribution is (1) realizable (p(⋅|θ∗)=q(⋅) for some θ∗), (2) invariant under some group action, q(x)=q(gx)∀x. Now, suppose that the model class is that of exponential models, i.e. p(x|w)∝exp(⟨θ,T(x)⟩). In particular, suppose that T, the fixed feature map, is G-equivariant, i.e.∃ρ:G→GL(Rd) such that T(gx)=ρ(g)T(x).
Claim: There is a degeneracy of the form p(x|θ∗)=p(x|ρ(g)∗(θ∗)), and in particular if G is a Lie group, the rank upper bound of RLCT decreases by 14dimG.
This is nothing nontrivial. The first claim is an immediate consequence of the definitions:
p(⋅|θ∗)=q(⋅) and q(x)=q(gx) implies p(x|θ∗)=p(gx|θ∗)∀x
Then, we have the following: p(gx∣θ∗)=exp(⟨θ∗,T(gx)⟩)=exp(⟨θ∗,ρ(g)T(x)⟩)=exp(⟨ρ∗(g)(θ∗),T(x)⟩)=p(x∣ρ(g)∗(θ∗)).
… and the latter claim on RLCT is a consequence of p(x|θ∗)=p(x|ρ(g)∗(θ∗)) reducing the rank of L(θ) at θ∗ by dimG together with the rank upper bound result here.
High-level idea: Emulability of input symmetry
While this model is very toy, I think the high-level idea for which this a concrete model of is interesting: Abstracting out, the proof of how data structure influence degeneracy routes through two steps:
The true distribution has some structure / symmetry, say, q(x)=q(x+δx)∀x (with δx as a function of x, indicating some infinitesimal change; all of this is meant to be taken heuristically), which gets imparted onto p(⋅|θ∗) by realizability, i.e. p(x|θ∗)=p(x+δx|θ∗)∀x.
Emulatability: At θ∗, the model can “emulate” certain classes of perturbations to certain classes of input x by instead perturbing the parameters, i.e. p(x+δx|θ∗)=p(x|θ∗+δθ).[3]
Basically, (1) realizablity imparts input-symmetry to p(⋅|θ∗), and (2) emulatability essentially “push-forwards” this to a symmetry in the parameters[4]. I think this is very interesting!
Story: Suppose I am tasked with image segmentation, but my visual cortex is perturbed by δθ, causing me to perceive colors with a slightly different hue. Then, if my visual cortex wasn’t perturbed but rather the world’s color shifted to that hue i.e. δx, then I would virtually not notice anything and be making the same predictions p(x+δx|θ∗)=p(x|θ∗+δθ).
Going back to the exponential model, the most unrealistic part of it (even after taking into account that it is a toy instantiation of this high-level idea) is the fact that its symmetry is generic: p(gx|θ)=p(x|ρ(g)∗(θ)) holds for ALL θ, since the G-equivariant T is independent of θ. A more realistic model would look something like p(x|w)∝exp(⟨θ1,Tθ2(x)⟩) where T also depends on θ2 and importantly, whether T satisfies G-equivariance depends on the value of θ2.
Then, if pθ∗=pθ′∗=q but θ∗ makes TG-equivariant while θ′∗ doesn’t, then the rank upper bound of the RLCT for the former is lower than that of the latter (thus θ∗ would be represented much more greatly in the Bayesian posterior).
This is more realistic, and I think sheds some light on why training imparts models with circuits / algorithms / internal symmetries that reflect structure in the data.
(Thanks to Dan Murfet for various related discussions.)
Very brief SLT context: In SLT, the main quantity of interest is RLCT, which broadly speaking is a measure of degeneracy of the most degenerate point among the optimal parameters. We care about this because it directly controls the asymptotics of the Bayesian posterior. Also, we often care about its localized version where we restrict the parameter space W to an infinitesimal neighborhood (germ) of a particular optimal parameter we’re interested in measuring the degeneracy of.
RLCT is a particular invariant of the average log likelihood function L(θ)=∫q(x)logp(x|θ)dx, meaning it is a function of the true distribution q(x) and the parametric model p(x|θ) (the choice of the prior φ(θ) doesn’t matter under reasonable regularity conditions).
Given a two layer feedforward network with ReLU, multiply the first layer by α and dividing the next by α implements the same function. Many other examples, including non-generic degeneracies which occur at particular weight values unlike the constant multiplication degeneracy which occurs at every θ; more examples in Liam Carroll’s thesis.
Let the input-side symmetry to be trivial (i.e. δx=0), and we recover degeneracies originating from the structure of the parameter-function map alone as a special case.
A simple sketch of the role data structure plays in loss landscape degeneracy.
The RLCT[1] is a function of both q(x) and p(x|θ). The role of p(x|θ) is clear enough, with very intuitive examples[2] of local degeneracy arising from the structure of the parameter function map. However until recently the intuitive role of q(x) really eluded me.
I think I now have some intuitive picture of how structure in q(x) influences RLCT (at least particular instances of it). Consider the following example.
Toy Example: G-invariant distribution, G-equivariant submodule
Suppose the true distribution is (1) realizable (p(⋅|θ∗)=q(⋅) for some θ∗), (2) invariant under some group action, q(x)=q(gx)∀x. Now, suppose that the model class is that of exponential models, i.e. p(x|w)∝exp(⟨θ,T(x)⟩). In particular, suppose that T, the fixed feature map, is G-equivariant, i.e.∃ρ:G→GL(Rd) such that T(gx)=ρ(g)T(x).
Claim: There is a degeneracy of the form p(x|θ∗)=p(x|ρ(g)∗(θ∗)), and in particular if G is a Lie group, the rank upper bound of RLCT decreases by 14dimG.
This is nothing nontrivial. The first claim is an immediate consequence of the definitions:
p(⋅|θ∗)=q(⋅) and q(x)=q(gx) implies p(x|θ∗)=p(gx|θ∗)∀x
Then, we have the following:
p(gx∣θ∗)=exp(⟨θ∗,T(gx)⟩)=exp(⟨θ∗,ρ(g)T(x)⟩)=exp(⟨ρ∗(g)(θ∗),T(x)⟩)=p(x∣ρ(g)∗(θ∗)).
… and the latter claim on RLCT is a consequence of p(x|θ∗)=p(x|ρ(g)∗(θ∗)) reducing the rank of L(θ) at θ∗ by dimG together with the rank upper bound result here.
High-level idea: Emulability of input symmetry
While this model is very toy, I think the high-level idea for which this a concrete model of is interesting: Abstracting out, the proof of how data structure influence degeneracy routes through two steps:
The true distribution has some structure / symmetry, say, q(x)=q(x+δx)∀x (with δx as a function of x, indicating some infinitesimal change; all of this is meant to be taken heuristically), which gets imparted onto p(⋅|θ∗) by realizability, i.e. p(x|θ∗)=p(x+δx|θ∗)∀x.
Emulatability: At θ∗, the model can “emulate” certain classes of perturbations to certain classes of input x by instead perturbing the parameters, i.e. p(x+δx|θ∗)=p(x|θ∗+δθ).[3]
Basically, (1) realizablity imparts input-symmetry to p(⋅|θ∗), and (2) emulatability essentially “push-forwards” this to a symmetry in the parameters[4]. I think this is very interesting!
Story: Suppose I am tasked with image segmentation, but my visual cortex is perturbed by δθ, causing me to perceive colors with a slightly different hue. Then, if my visual cortex wasn’t perturbed but rather the world’s color shifted to that hue i.e. δx, then I would virtually not notice anything and be making the same predictions p(x+δx|θ∗)=p(x|θ∗+δθ).
Going back to the exponential model, the most unrealistic part of it (even after taking into account that it is a toy instantiation of this high-level idea) is the fact that its symmetry is generic: p(gx|θ)=p(x|ρ(g)∗(θ)) holds for ALL θ, since the G-equivariant T is independent of θ. A more realistic model would look something like p(x|w)∝exp(⟨θ1,Tθ2(x)⟩) where T also depends on θ2 and importantly, whether T satisfies G-equivariance depends on the value of θ2.
Then, if pθ∗=pθ′∗=q but θ∗ makes T G-equivariant while θ′∗ doesn’t, then the rank upper bound of the RLCT for the former is lower than that of the latter (thus θ∗ would be represented much more greatly in the Bayesian posterior).
This is more realistic, and I think sheds some light on why training imparts models with circuits / algorithms / internal symmetries that reflect structure in the data.
(Thanks to Dan Murfet for various related discussions.)
Very brief SLT context: In SLT, the main quantity of interest is RLCT, which broadly speaking is a measure of degeneracy of the most degenerate point among the optimal parameters. We care about this because it directly controls the asymptotics of the Bayesian posterior. Also, we often care about its localized version where we restrict the parameter space W to an infinitesimal neighborhood (germ) of a particular optimal parameter we’re interested in measuring the degeneracy of.
RLCT is a particular invariant of the average log likelihood function L(θ)=∫q(x)logp(x|θ)dx, meaning it is a function of the true distribution q(x) and the parametric model p(x|θ) (the choice of the prior φ(θ) doesn’t matter under reasonable regularity conditions).
Given a two layer feedforward network with ReLU, multiply the first layer by α and dividing the next by α implements the same function. Many other examples, including non-generic degeneracies which occur at particular weight values unlike the constant multiplication degeneracy which occurs at every θ; more examples in Liam Carroll’s thesis.
This reminds me of the notion of data-program equivalence (programs-as-data, Gödel numbering, UTM). Perhaps some infinitesimal version of it?
Let the input-side symmetry to be trivial (i.e. δx=0), and we recover degeneracies originating from the structure of the parameter-function map alone as a special case.