I don’t think this is something that requires explanation, though. If you take an arbitrary geometric object in maths, a good definition of its singular points will be “points where the tangent space has higher dimension than expected”. If this is the minimum set of a loss function and the tangent space has higher dimension than expected, that intuitively means that locally there are more directions you can move along without changing the loss function, probably suggesting that there are more directions you can move along without changing the function being implemented at all. So the function being implemented is simple, and the rest of the argument works as I outline it in the post.
I think I understand what you and Jesse are getting at, though: there’s a particular behavior that only becomes visible in the smooth or analytic setting, which is that minima of the loss function that are more singular become more dominant as n→∞ in the Boltzmann integral, as opposed to maintaining just the same dominance factor of e−O(d). You don’t see this in the discrete case because there’s a finite nonzero gap in loss between first-best and second-best fits, and so the second-best fits are exponentially punished in the limit and become irrelevant, while in the singular case any first-best fit has some second best “space” surrounding it whose volume is more concentrated towards the singularity point.
While I understand that, I’m not too sure what predictions you would make about the behavior of neural networks on the basis of this observation. For instance, if this smooth behavior is really essential to the generalization of NNs, wouldn’t we predict that generalization would become worse as people switch to lower precision floating point numbers? I don’t think that prediction would have held up very well if someone had made it 5 years ago.
If this is the minimum set of a loss function and the tangent space has higher dimension than expected, that intuitively means that locally there are more directions you can move along without changing the loss function
I think it is pretty obvious in the case of valleys without self-intersections, but that’s just the broad basin case. As for the self-intersection case, well, if it’s obvious to you that singularities will be surrounded by narrow bands of larger dimensionality—including in cases where that “dimensionality” is fractional—then you have a better intuition for the geometry of singularities than me and, I suspect, most other readers, so it might be helpful to make that aspect explicit.
Say that you have a loss function L:Rn→R. The minimum loss set is probably not exactly ∇L=0, but it has something to do with that, so let’s pretend that it’s exactly that for now.
This is a collection of n equations that are generically independent and so should define a subset of dimension zero, i.e. a collection of points in Rn. However, there might be points at which the partial derivatives vanishing don’t define independent equations, so we get something of positive codimension.
In these cases, what happens is that the gradient ∇L itself has vanishing derivatives in some directions. In other words, the Hessian matrix ∇2L fails to be of full rank. Say that this matrix has rank r at a specific singular point p∈Rn and consider the set L<Lmin+ε. Diagonalizing ∇2L will generically bring L into a form where it’s the linear combination of r quadratic terms and higher-order cubic terms, and locally the volume contribution to this set around p will be something of order εr/2ε(n−r)/3=εr/6+n/3. The worse the singularity, the smaller the rank r and the greater the volume contribution of the singularity to the set L<Lmin+ε.
The worst singularities dominate the behavior at small ε because you can move “much further” along vectors where L scales in a cubic fashion than directions where it scales in a quadratic fashion, so those dimensions are the only ones that “count” in some calculation when you compare singularities. The tangent space intuition doesn’t apply directly here but something like that still applies, in the sense that the worse a singularity, the more directions you have to move away from it without changing the value of the loss very much.
Is this intuitive now? I’m not sure what more to do to make the result intuitive.
Hmm, what you’re describing is still in what I was referring to as “the broad basin regime”. Sorry if I was unclear—I was thinking of any case where there is no self-intersection of the minimum loss manifold as being a “broad basin”. I think the main innovation of SLT occurs elsewhere.
Look at the image in the tweet I linked. At the point where the curves intersect, it’s not just that the Hessian fails to be of full-rank, it’s not even well-defined. The image illustrates how volume clusters around a single point where the singularity is, not merely around the minimal-loss manifold with the greatest dimensionality. That is what is novel about singular learning theory.
Can you give an example of L which has the mode of singularity you’re talking about? I don’t think I’m quite following what you’re talking about here.
In SLT L is assumed analytic, so I don’t understand how the Hessian can fail to be well-defined anywhere. It’s possible that the Hessian vanishes at some point, suggesting that the singularity there is even worse than quadratic, e.g.L(x,y)=x2y2 at the origin or something like that. But even in this regime essentially the same logic is going to apply—the worse the singularity, the further away you can move from it without changing the value of L very much, and accordingly the singularity contributes more to the volume of the set L(x)<Lmin+ε as ε→0.
In SLT L is assumed analytic, so I don’t understand how the Hessian can fail to be well-defined
Yeah sorry that was probably needlessly confusing, I was just referencing the image in Jesse’s tweet for ease of illustration(you’re right that it’s not analytic, I’m not sure what’s going on there) The Hessian could also just be 0 at a self-intersection point like in the example you gave. That’s the sort of case I had in mind. I was confused by your earlier comment because it sounded like you were just describing a valley of dimension r, but as you say there could be isolated points like that also.
I still maintain that this behavior—of volume clustering near singularities when considering a narrow band about the loss minimum—is the main distinguishing feature of SLT and so could use a mention in the OP.
I don’t think this is something that requires explanation, though. If you take an arbitrary geometric object in maths, a good definition of its singular points will be “points where the tangent space has higher dimension than expected”. If this is the minimum set of a loss function and the tangent space has higher dimension than expected, that intuitively means that locally there are more directions you can move along without changing the loss function, probably suggesting that there are more directions you can move along without changing the function being implemented at all. So the function being implemented is simple, and the rest of the argument works as I outline it in the post.
I think I understand what you and Jesse are getting at, though: there’s a particular behavior that only becomes visible in the smooth or analytic setting, which is that minima of the loss function that are more singular become more dominant as n→∞ in the Boltzmann integral, as opposed to maintaining just the same dominance factor of e−O(d). You don’t see this in the discrete case because there’s a finite nonzero gap in loss between first-best and second-best fits, and so the second-best fits are exponentially punished in the limit and become irrelevant, while in the singular case any first-best fit has some second best “space” surrounding it whose volume is more concentrated towards the singularity point.
While I understand that, I’m not too sure what predictions you would make about the behavior of neural networks on the basis of this observation. For instance, if this smooth behavior is really essential to the generalization of NNs, wouldn’t we predict that generalization would become worse as people switch to lower precision floating point numbers? I don’t think that prediction would have held up very well if someone had made it 5 years ago.
I think it is pretty obvious in the case of valleys without self-intersections, but that’s just the broad basin case. As for the self-intersection case, well, if it’s obvious to you that singularities will be surrounded by narrow bands of larger dimensionality—including in cases where that “dimensionality” is fractional—then you have a better intuition for the geometry of singularities than me and, I suspect, most other readers, so it might be helpful to make that aspect explicit.
Say that you have a loss function L:Rn→R. The minimum loss set is probably not exactly ∇L=0, but it has something to do with that, so let’s pretend that it’s exactly that for now.
This is a collection of n equations that are generically independent and so should define a subset of dimension zero, i.e. a collection of points in Rn. However, there might be points at which the partial derivatives vanishing don’t define independent equations, so we get something of positive codimension.
In these cases, what happens is that the gradient ∇L itself has vanishing derivatives in some directions. In other words, the Hessian matrix ∇2L fails to be of full rank. Say that this matrix has rank r at a specific singular point p∈Rn and consider the set L<Lmin+ε. Diagonalizing ∇2L will generically bring L into a form where it’s the linear combination of r quadratic terms and higher-order cubic terms, and locally the volume contribution to this set around p will be something of order εr/2ε(n−r)/3=εr/6+n/3. The worse the singularity, the smaller the rank r and the greater the volume contribution of the singularity to the set L<Lmin+ε.
The worst singularities dominate the behavior at small ε because you can move “much further” along vectors where L scales in a cubic fashion than directions where it scales in a quadratic fashion, so those dimensions are the only ones that “count” in some calculation when you compare singularities. The tangent space intuition doesn’t apply directly here but something like that still applies, in the sense that the worse a singularity, the more directions you have to move away from it without changing the value of the loss very much.
Is this intuitive now? I’m not sure what more to do to make the result intuitive.
Hmm, what you’re describing is still in what I was referring to as “the broad basin regime”. Sorry if I was unclear—I was thinking of any case where there is no self-intersection of the minimum loss manifold as being a “broad basin”. I think the main innovation of SLT occurs elsewhere.
Look at the image in the tweet I linked. At the point where the curves intersect, it’s not just that the Hessian fails to be of full-rank, it’s not even well-defined. The image illustrates how volume clusters around a single point where the singularity is, not merely around the minimal-loss manifold with the greatest dimensionality. That is what is novel about singular learning theory.
Can you give an example of L which has the mode of singularity you’re talking about? I don’t think I’m quite following what you’re talking about here.
In SLT L is assumed analytic, so I don’t understand how the Hessian can fail to be well-defined anywhere. It’s possible that the Hessian vanishes at some point, suggesting that the singularity there is even worse than quadratic, e.g.L(x,y)=x2y2 at the origin or something like that. But even in this regime essentially the same logic is going to apply—the worse the singularity, the further away you can move from it without changing the value of L very much, and accordingly the singularity contributes more to the volume of the set L(x)<Lmin+ε as ε→0.
Yeah sorry that was probably needlessly confusing, I was just referencing the image in Jesse’s tweet for ease of illustration(you’re right that it’s not analytic, I’m not sure what’s going on there) The Hessian could also just be 0 at a self-intersection point like in the example you gave. That’s the sort of case I had in mind. I was confused by your earlier comment because it sounded like you were just describing a valley of dimension r, but as you say there could be isolated points like that also.
I still maintain that this behavior—of volume clustering near singularities when considering a narrow band about the loss minimum—is the main distinguishing feature of SLT and so could use a mention in the OP.