Let V(ϵ) be volume of a behavioral region at cutoff ϵ. Your behavioral LLC at finite noise scale is λ(ϵ):=dlogV/dlogϵ, which is invariant under rescaling V by a constant. This information about the overall scale of V seems important. What’s the reason for throwing it out in SLT?
Because it’s actually not very important in the limit. The dimensionality of V is what matters. A 3-dimensional sphere in the loss landscape always takes up more of the prior than a 2-dimensional circle, no matter how large the area of the circle is and how small the volume of the sphere is.
In real life, parameters are finite precision floats, and so this tends to work out to an exponential rather than infinite size advantage. So constant prefactors can matter in principle. But they have to be really really big.
It is unimportant in the limit (of infinite data), but away from that limit, it is only unimportant by a factor of 1/log(data), which seems small enough to be beatable in practice in some circumstances.
The spectra of things like Hessians tend to be singular, yes, but also sort of power-law. This makes the dimensionality a bit fuzzy and (imo) makes it possible for absolute volume scale of basins to compete with dimensionality.
Essentially: it’s not clear that a 301-dimensional sphere really is “bigger” than a 300-dimensional sphere, if the 300-dimensional sphere has a much larger radius. (Obviously it’s true in a strict sense, but hopefully you know what I’m gesturing at here.)
Let V(ϵ) be volume of a behavioral region at cutoff ϵ. Your behavioral LLC at finite noise scale is λ(ϵ):=dlogV/dlogϵ, which is invariant under rescaling V by a constant. This information about the overall scale of V seems important. What’s the reason for throwing it out in SLT?
Because it’s actually not very important in the limit. The dimensionality of V is what matters. A 3-dimensional sphere in the loss landscape always takes up more of the prior than a 2-dimensional circle, no matter how large the area of the circle is and how small the volume of the sphere is.
In real life, parameters are finite precision floats, and so this tends to work out to an exponential rather than infinite size advantage. So constant prefactors can matter in principle. But they have to be really really big.
I am not sure I agree :)
It is unimportant in the limit (of infinite data), but away from that limit, it is only unimportant by a factor of 1/log(data), which seems small enough to be beatable in practice in some circumstances.
The spectra of things like Hessians tend to be singular, yes, but also sort of power-law. This makes the dimensionality a bit fuzzy and (imo) makes it possible for absolute volume scale of basins to compete with dimensionality.
Essentially: it’s not clear that a 301-dimensional sphere really is “bigger” than a 300-dimensional sphere, if the 300-dimensional sphere has a much larger radius. (Obviously it’s true in a strict sense, but hopefully you know what I’m gesturing at here.)