Really love the post overall, thank you for putting this together! There’s a whole lot I could say here, but for now I’ll just be annoying and go off on a tangent which I wrote up before realizing just how tangential it is. I think it still might be a useful clarification for some people, so I’ll post it anyway though.
In some sense the main geometric object of SLT is the Fisher information matrix F, which is the negative expected Hessian of the log-likelihood (that is, the Hessian of the population loss).
I’m sure you’re aware of this and were just presenting a simplified explanation, but for anyone who might take this too much at face value, the Fisher information matrix (FIM) is not the negative expected Hessian of the log-likelihood in general. The FIM is defined as:
I(θ)=Ex∼p(x|θ)[∇logp(x|θ)⋅∇logp(x|θ)T],
whereas the Hessian is defined as
H(θ)=Ex∼q(x)[∇2logp(x|θ)].
The difference is whether the expectation is taken over the true distribution q(x) or the current model distribution p(x|θ). They coincide only when 1. the data generating distribution q(x) is realizable by a true parameter θ∗ such that q(x)=p(x|θ∗) and 2. we’re evaluating the Hessian at θ∗.
I mention this only because the FIM is often misunderstood. See for instance this paper detailing various common conflations people make here.
How do I think of the FIM, then? A more geometric way to understand the FIM is as follows. Consider the Hellinger embedding, which maps parameters to square-root probability densities:
θ↦√p(x|θ)
The target space is (a subset of) of L2, which has a flat Euclidean inner product ⟨f,g⟩=∫f(x)g(x)dx. Let J be the Jacobian of this map. Then the FIM is simply:
I(θ)=4JTJ
This is clarifying because it makes precise exactly the story the post is making, about some parameter directions being “sloppier” vs “more sensitive” at impacting the model. The key fact is that eigenvalues of the FIM I(θ)=4JTJ are (four times) the squared singular values of the Jacobian J.
Large eigenvalues of the FIM I(θ) thus correspond to parameter directions that move the distribution significantly in L2; small eigenvalues correspond to directions along which the distribution barely changes. The kernel of I(θ) consists of directions that are completely invisible to the model—distinct parameters, identical distributions.
Really love the post overall, thank you for putting this together! There’s a whole lot I could say here, but for now I’ll just be annoying and go off on a tangent which I wrote up before realizing just how tangential it is. I think it still might be a useful clarification for some people, so I’ll post it anyway though.
I’m sure you’re aware of this and were just presenting a simplified explanation, but for anyone who might take this too much at face value, the Fisher information matrix (FIM) is not the negative expected Hessian of the log-likelihood in general. The FIM is defined as:
I(θ)=Ex∼p(x|θ)[∇logp(x|θ)⋅∇logp(x|θ)T],
whereas the Hessian is defined as
H(θ)=Ex∼q(x)[∇2logp(x|θ)].
The difference is whether the expectation is taken over the true distribution q(x) or the current model distribution p(x|θ). They coincide only when 1. the data generating distribution q(x) is realizable by a true parameter θ∗ such that q(x)=p(x|θ∗) and 2. we’re evaluating the Hessian at θ∗.
I mention this only because the FIM is often misunderstood. See for instance this paper detailing various common conflations people make here.
How do I think of the FIM, then? A more geometric way to understand the FIM is as follows. Consider the Hellinger embedding, which maps parameters to square-root probability densities:
θ↦√p(x|θ)
The target space is (a subset of) of L2, which has a flat Euclidean inner product ⟨f,g⟩=∫f(x)g(x)dx. Let J be the Jacobian of this map. Then the FIM is simply:
I(θ)=4JTJ
This is clarifying because it makes precise exactly the story the post is making, about some parameter directions being “sloppier” vs “more sensitive” at impacting the model. The key fact is that eigenvalues of the FIM I(θ)=4JTJ are (four times) the squared singular values of the Jacobian J.
Large eigenvalues of the FIM I(θ) thus correspond to parameter directions that move the distribution significantly in L2; small eigenvalues correspond to directions along which the distribution barely changes. The kernel of I(θ) consists of directions that are completely invisible to the model—distinct parameters, identical distributions.