Does footnote 15 imply that somehow the parameter vector works its way to a certain part of the parameter space, where it gets stuck because the loss function gradients can’t steer it out?
Yes, this is correct, because (as I explain briefly in the “search problem” section) the loss function factors as a composition of the parameter-function map and the function-loss map, so by chain rule you’ll always get zero gradient in degenerate directions. (And if you’re near degenerate, you’ll get near-zero gradients, proportional to the degree of degeneracy.) So SGD will find it hard to get unstuck from degeneracies.
This is actually how I think degeneracies affect SGD, but it’s worth mentioning that this isn’t the only mechanism that could be possible. For instance, degeneracies also affect Bayesian learning (this is precisely the classical SLT story!), where there are no gradients at all. The reason is that (by the same chain rule logic) the loss function is “flatter” around degeneracies, creating an implicit prior by dedicating more of parameter space to more degenerate solutions. Both mechanisms push in the same direction, creating an inductive bias favoring more degenerate solutions.
Also, does this interpretation mean that simplicity is related to (or even more accurately described as) robustness, which would make intuitive sense to me. In this case different measures of simplicity could be reframed as measures of robustness to different types of perturbation.
This is basically correct, and you can make this precise. The intuition is that robustness to parameter perturbation corresponds to simplicity because many parameter configurations implement the same effective computation—the solution doesn’t depend on fine-tuning.
Singular learning theory measures “how degenerate” your loss function is using a quantity called the local learning coefficient, which you can view as a kind of “robustness to perturbation” measure. Section 3.1 of my paper explains some of the intuition here. In Bayesian learning, this is basically the unambiguously “correct” measure of (model) complexity, because it literally determines the generalization error / free energy to leading order (Main Formula II, Watanabe 2009).
Thanks for your reply. It’s a fascinating topic and I’ve got lots of follow-up questions but I’ll read the paper and book first to get a better idea of which questions have already been addressed.
(edit 2 days later): Whoah. There’s a lot of material in the book, in your paper and in those from your research group. I didn’t realize that one could say so much about flatness! It’s very likely I have misunderstood, but are you guys talking about why a model seems to end up on a particular part of a (high dimensional) ridge/plateau of the loss function? The relationship between parameter perturbations and data perturbations is interesting. Do you think robustness to parameter perturbations is acting as a proxy for robustness to data perturbations, which is what we really want? Also, on a more technical note, is the Hironaka Theorem of use when the loss function is effectively piecewise quadratic? Are you concerned that collapsing down so a simple/robust program/function appears to be a one-way process (i.e. it doesn’t look like you could undo it)?
There are too many questions here and there’s no obligation to answer them. I will continue reading around the topic when I have time. Perhaps one day I can write things up for sharing.
Program synthesis is an interesting direction to take these ideas. I hope it pays off. It’s pretty hard to judge. I guess animals need to be robust to parts of their nervous system malfunctioning and people need to be robust to parts of their belief system falling through. Compartmentalisation of the programs/concepts would help with this.
Great questions, thank you!
Yes, this is correct, because (as I explain briefly in the “search problem” section) the loss function factors as a composition of the parameter-function map and the function-loss map, so by chain rule you’ll always get zero gradient in degenerate directions. (And if you’re near degenerate, you’ll get near-zero gradients, proportional to the degree of degeneracy.) So SGD will find it hard to get unstuck from degeneracies.
This is actually how I think degeneracies affect SGD, but it’s worth mentioning that this isn’t the only mechanism that could be possible. For instance, degeneracies also affect Bayesian learning (this is precisely the classical SLT story!), where there are no gradients at all. The reason is that (by the same chain rule logic) the loss function is “flatter” around degeneracies, creating an implicit prior by dedicating more of parameter space to more degenerate solutions. Both mechanisms push in the same direction, creating an inductive bias favoring more degenerate solutions.
This is basically correct, and you can make this precise. The intuition is that robustness to parameter perturbation corresponds to simplicity because many parameter configurations implement the same effective computation—the solution doesn’t depend on fine-tuning.
Singular learning theory measures “how degenerate” your loss function is using a quantity called the local learning coefficient, which you can view as a kind of “robustness to perturbation” measure. Section 3.1 of my paper explains some of the intuition here. In Bayesian learning, this is basically the unambiguously “correct” measure of (model) complexity, because it literally determines the generalization error / free energy to leading order (Main Formula II, Watanabe 2009).
Thanks for your reply. It’s a fascinating topic and I’ve got lots of follow-up questions but I’ll read the paper and book first to get a better idea of which questions have already been addressed.
(edit 2 days later): Whoah. There’s a lot of material in the book, in your paper and in those from your research group. I didn’t realize that one could say so much about flatness! It’s very likely I have misunderstood, but are you guys talking about why a model seems to end up on a particular part of a (high dimensional) ridge/plateau of the loss function? The relationship between parameter perturbations and data perturbations is interesting. Do you think robustness to parameter perturbations is acting as a proxy for robustness to data perturbations, which is what we really want? Also, on a more technical note, is the Hironaka Theorem of use when the loss function is effectively piecewise quadratic? Are you concerned that collapsing down so a simple/robust program/function appears to be a one-way process (i.e. it doesn’t look like you could undo it)?
There are too many questions here and there’s no obligation to answer them. I will continue reading around the topic when I have time. Perhaps one day I can write things up for sharing.
Program synthesis is an interesting direction to take these ideas. I hope it pays off. It’s pretty hard to judge. I guess animals need to be robust to parts of their nervous system malfunctioning and people need to be robust to parts of their belief system falling through. Compartmentalisation of the programs/concepts would help with this.