tgb comments on My impression of singular learning theory

tgb 20 Jun 2023 11:37 UTC
4 points
0
Everything I wrote in steps 1-4 was done in a discrete setting (otherwise $| A^{- 1} (f_{0}) |$ is not finite and whole thing falls apart). I was intending $θ$ to be pairs of floating point numbers and $A$ to be floats to floats.

However, using that I think I see what you’re trying to say. Which is that $θ_{1} θ_{2}$ will equal zero for some cases where $θ_{1}$ and $θ_{2}$ are both non-zero but very small and will multiply down to zero due to the limits of floating point numbers. Therefore the pre-image of $A^{- 1} (f_{0})$ is actually larger than I claimed, and specifically contains a small neighborhood of $(0, 0)$ .

That doesn’t invalidate my calculation that shows that $(0, 0)$ is equally likely as $(0, 1)$ though: they still have the same loss and $A$ -complexity (since they have the same macrostate). On the other hand, you’re saying that there are points in parameter space that are very close to $(0, 0)$ that are also in this same pre-image and also equally likely. Therefore even if $(0, 0)$ is just as likely as $(0, 1)$ , being near to $(0, 0)$ is more likely than being near to $(0, 1)$ . I think it’s fair to say that that is at least qualitatively the same as SLT gives in the continous version of this.

However, I do think this result “happened” due to factors that weren’t discussed in your original post, which makes it sound like it is “due to” $A$ -complexity. $A$ -complexity is a function of the macrostate, which is the same at all of these points and so does not distinguish between $(0, 0)$ and $(0, 1)$ at all. In other words, your post tells me which $f$ is likely while SLT tells me which $θ$ is likely—these are not the same thing. But you clearly have additional ideas not stated in the post that also help you figure out which $θ$ is likely. Until that is clarified, I think you have a mental theory of this which is very different from what you wrote.
- Ege Erdil 20 Jun 2023 12:39 UTC
  4 points
  0
  Parent
  Sure, I agree that I didn’t put this information into the post. However, why do you need to know which $θ$ is more likely to know anything about e.g. how neural networks generalize?
  
  I understand that SLT has some additional content beyond what is in the post, and I’ve tried to explain how you could make that fit in this framework. I just don’t understand why that additional content is relevant, which is why I left it out.
  
  As an additional note, I wasn’t really talking about floating point precision being the important variable here. I’m just saying that if you want $A$ -complexity to match the notion of real log canonical threshold, you have to discretize SLT in a way that might not be obvious at first glance, and in a way where some conclusions end up being scale-dependent. This is why if you’re interested in studying this question of the relative contribution of singular points to the partition function, SLT is a better setting to be doing it in. At the risk of repeating myself, I just don’t know why you would try to do that.
  - tgb 20 Jun 2023 13:51 UTC
    5 points
    1
    Parent
    In my view, it’s a significant philosophical difference between SLT and your post that your post talks only about choosing macrostates while SLT talks about choosing microstates. I’m much less qualified to know (let alone explain) the benefits of SLT, though I can speculate. If we stop training after a finite number of steps, then I think it’s helpful to know where it’s converging to. In my example, if you think it’s converging to $(0, 1)$ , then stopping close to that will get you a function that doesn’t generalize too well. If you know it’s converging to $(0, 0)$ then stopping close to that will get you a much better function—possibly exactly equally as good as you pointed out due to discretization.
    Now this logic is basically exactly what you’re saying in these comments! But I think if someone read your post without prior knowledge of SLT, they wouldn’t figure out that it’s more likely to converge to a point near $(0, 0)$ than near $(0, 1)$ . If they read an SLT post instead, they would figure that out. In that sense, SLT is more useful.
    
    I am not confident that that is the intended benefit of SLT according to its proponents, though. And I wouldn’t be surprised if you could write a simpler explanation of this in your framework than SLT gives, I just think that this post wasn’t it.