Ege Erdil comments on My impression of singular learning theory

Ege Erdil 18 Jun 2023 17:46 UTC
4 points
0
Say that you have a loss function $L : R^{n} \to R$ . The minimum loss set is probably not exactly $\nabla L = 0$ , but it has something to do with that, so let’s pretend that it’s exactly that for now.

This is a collection of $n$ equations that are generically independent and so should define a subset of dimension zero, i.e. a collection of points in $R^{n}$ . However, there might be points at which the partial derivatives vanishing don’t define independent equations, so we get something of positive codimension.

In these cases, what happens is that the gradient $\nabla L$ itself has vanishing derivatives in some directions. In other words, the Hessian matrix $\nabla^{2} L$ fails to be of full rank. Say that this matrix has rank $r$ at a specific singular point $p \in R^{n}$ and consider the set $L < L_{min} + ε$ . Diagonalizing $\nabla^{2} L$ will generically bring $L$ into a form where it’s the linear combination of $r$ quadratic terms and higher-order cubic terms, and locally the volume contribution to this set around $p$ will be something of order $ε^{r / 2} ε^{(n - r) / 3} = ε^{r / 6 + n / 3}$ . The worse the singularity, the smaller the rank $r$ and the greater the volume contribution of the singularity to the set $L < L_{min} + ε$ .

The worst singularities dominate the behavior at small $ε$ because you can move “much further” along vectors where $L$ scales in a cubic fashion than directions where it scales in a quadratic fashion, so those dimensions are the only ones that “count” in some calculation when you compare singularities. The tangent space intuition doesn’t apply directly here but something like that still applies, in the sense that the worse a singularity, the more directions you have to move away from it without changing the value of the loss very much.

Is this intuitive now? I’m not sure what more to do to make the result intuitive.
- interstice 18 Jun 2023 18:05 UTC
  2 points
  0
  Parent
  Hmm, what you’re describing is still in what I was referring to as “the broad basin regime”. Sorry if I was unclear—I was thinking of any case where there is no self-intersection of the minimum loss manifold as being a “broad basin”. I think the main innovation of SLT occurs elsewhere.
  
  Look at the image in the tweet I linked. At the point where the curves intersect, it’s not just that the Hessian fails to be of full-rank, it’s not even well-defined. The image illustrates how volume clusters around a single point where the singularity is, not merely around the minimal-loss manifold with the greatest dimensionality. That is what is novel about singular learning theory.
  - Ege Erdil 19 Jun 2023 10:21 UTC
    2 points
    0
    Parent
    Can you give an example of $L$ which has the mode of singularity you’re talking about? I don’t think I’m quite following what you’re talking about here.
    
    In SLT $L$ is assumed analytic, so I don’t understand how the Hessian can fail to be well-defined anywhere. It’s possible that the Hessian vanishes at some point, suggesting that the singularity there is even worse than quadratic, e.g. $L (x, y) = x^{2} y^{2}$ at the origin or something like that. But even in this regime essentially the same logic is going to apply—the worse the singularity, the further away you can move from it without changing the value of $L$ very much, and accordingly the singularity contributes more to the volume of the set $L (x) < L_{min} + ε$ as $ε \to 0$ .
    - interstice 19 Jun 2023 21:15 UTC
      4 points
      2
      Parent
      
      In SLT L is assumed analytic, so I don’t understand how the Hessian can fail to be well-defined
      
      Yeah sorry that was probably needlessly confusing, I was just referencing the image in Jesse’s tweet for ease of illustration(you’re right that it’s not analytic, I’m not sure what’s going on there) The Hessian could also just be 0 at a self-intersection point like in the example you gave. That’s the sort of case I had in mind. I was confused by your earlier comment because it sounded like you were just describing a valley of dimension $r$ , but as you say there could be isolated points like that also.
      
      I still maintain that this behavior—of volume clustering near singularities when considering a narrow band about the loss minimum—is the main distinguishing feature of SLT and so could use a mention in the OP.