‘Local volume’ should also give a kind of upper bound on the LLC defined at finite noise though, right? Since as I understand it, what you’re referring to as the volume of a behavioral region here is the same thing we define via the behavioural LLC at finite noise scale in this paper? And that’s always going to be bigger or equal to the LLC taken at the same point at the same finite noise scale.
Let V(ϵ) be volume of a behavioral region at cutoff ϵ. Your behavioral LLC at finite noise scale is λ(ϵ):=dlogV/dlogϵ, which is invariant under rescaling V by a constant. This information about the overall scale of V seems important. What’s the reason for throwing it out in SLT?
Because it’s actually not very important in the limit. The dimensionality of V is what matters. A 3-dimensional sphere in the loss landscape always takes up more of the prior than a 2-dimensional circle, no matter how large the area of the circle is and how small the volume of the sphere is.
In real life, parameters are finite precision floats, and so this tends to work out to an exponential rather than infinite size advantage. So constant prefactors can matter in principle. But they have to be really really big.
It is unimportant in the limit (of infinite data), but away from that limit, it is only unimportant by a factor of 1/log(data), which seems small enough to be beatable in practice in some circumstances.
The spectra of things like Hessians tend to be singular, yes, but also sort of power-law. This makes the dimensionality a bit fuzzy and (imo) makes it possible for absolute volume scale of basins to compete with dimensionality.
Essentially: it’s not clear that a 301-dimensional sphere really is “bigger” than a 300-dimensional sphere, if the 300-dimensional sphere has a much larger radius. (Obviously it’s true in a strict sense, but hopefully you know what I’m gesturing at here.)
I think this is correct but we’re working on paper rebuttals/revisions, I’ll take a closer look very soon! I think we’re working along parallel lines.
In particular, I have been thinking of “measure volumes at varying cutoffs” as being more or less equivalent to “measure LLC at varying ε”.
We choose expected KL divergence as a cost function because it gives a behavioral loss, just like your behavioral LLC, yes.
I can give more precise statements once I look at my notes.
‘Local volume’ should also give a kind of upper bound on the LLC defined at finite noise though, right? Since as I understand it, what you’re referring to as the volume of a behavioral region here is the same thing we define via the behavioural LLC at finite noise scale in this paper? And that’s always going to be bigger or equal to the LLC taken at the same point at the same finite noise scale.
Let V(ϵ) be volume of a behavioral region at cutoff ϵ. Your behavioral LLC at finite noise scale is λ(ϵ):=dlogV/dlogϵ, which is invariant under rescaling V by a constant. This information about the overall scale of V seems important. What’s the reason for throwing it out in SLT?
Because it’s actually not very important in the limit. The dimensionality of V is what matters. A 3-dimensional sphere in the loss landscape always takes up more of the prior than a 2-dimensional circle, no matter how large the area of the circle is and how small the volume of the sphere is.
In real life, parameters are finite precision floats, and so this tends to work out to an exponential rather than infinite size advantage. So constant prefactors can matter in principle. But they have to be really really big.
I am not sure I agree :)
It is unimportant in the limit (of infinite data), but away from that limit, it is only unimportant by a factor of 1/log(data), which seems small enough to be beatable in practice in some circumstances.
The spectra of things like Hessians tend to be singular, yes, but also sort of power-law. This makes the dimensionality a bit fuzzy and (imo) makes it possible for absolute volume scale of basins to compete with dimensionality.
Essentially: it’s not clear that a 301-dimensional sphere really is “bigger” than a 300-dimensional sphere, if the 300-dimensional sphere has a much larger radius. (Obviously it’s true in a strict sense, but hopefully you know what I’m gesturing at here.)
I think this is correct but we’re working on paper rebuttals/revisions, I’ll take a closer look very soon! I think we’re working along parallel lines.
In particular, I have been thinking of “measure volumes at varying cutoffs” as being more or less equivalent to “measure LLC at varying ε”.
We choose expected KL divergence as a cost function because it gives a behavioral loss, just like your behavioral LLC, yes.
I can give more precise statements once I look at my notes.