Not quite. SLT is for a specific subcase of Bayesian learning only, not SGD. Maybe more importantly for this point, it also doesn’t really show why neural network priors are good, just that neural network priors strongly favour some solutions over others.
Some SLT-adjacent stuff is pretty strongly suggestive of a proper answer, but I don’t think there’s a proper full proof of what we want in generality written up publicly yet.
SLT studies the limit as the number of data points goes to infty. this is the opposite of overparametrized! also this seems at least on the face of it like a bizarre setting for studying generalization, which is about guessing correctly after seeing only a small amount of data
i think the subset of weight space corresponding to a function is generally not well thought of as a small local region around any weight vector, especially not in the overparametrized case.
edit added later: however it’s plausible that with the mean field prior scaling you get a contribution to [the prior on a function] from ( of) a macroscopic ball around a certain weight vector which is of [the prior on that function] but however in a weaker sense a decent chunk of the entire prior on that function anyway.[1] so in that sense there might be an interesting semi-local thing going on. sorry i’ll need to think more about this
imo it’s good to scrap a bunch of the story given. the part i’d keep is “in cases where NN bayesianism has good generalization properties[2], simpler functions[3] generally have more prior weight than more complicated functions” (but this is roughly an obvious logical truth that has basically nothing to do with SLT?), and then the question is “why do simpler functions have more prior?”, ie “why do simpler functions have some combination of implementations having smaller weight norm and taking more volume”, and i think one is better off approaching that question basically from scratch. (also this is all about understanding NN bayes. SGD is a meaningfully different thing.)
which probably isn’t always. eg it’s probably pretty false for the prior scaling that gives NNGP in the wide limit. a good story would be able to “see” this difference between differently scaled gaussian priors
SLT is not about a limit as the number of data points goes to infinity. Or at least, it is about such a limit insofar as talking about the mean of a random variable is about “studying the limit as the number of data points goes to infinity” which is not how one would normally talk about such things. In particular, I think when people use this phrasing they are (either deliberately or not) making a comparison to “infinite width limits” and I do not think this is a correct analogy.
So there’s two things here: one is the relation between an empirical loss and the population loss , which is the expectation over with respect to the dataset. The mean of a random variable (in this case a function) is an idealised quantity never encountered in practice, for sure. However, to describe a theory that is organised around means of random variables as being “about” infinite limits seems at odds with how most people would think about statistics.
Most of the nontrivial mathematical content of SLT is exactly about accounting for the difference between and the actual you encounter in the real world (and as you say, generalisation has this form: the conceptual content of the theory is a surprising fact, that geometry of the mean object governs the generalisation behaviour at finite , and these are not some exotic effects that are only visible at enormous , as many of the examples in Watanabe’s textbooks will show you).
The other is the role of in the asymptotic expansions that characterise some of the central theorems in SLT. Here it is true that one would expect these analyses to be more correct as becomes larger, and at any value of one cannot a priori rule out that “lower order terms” in fact contribute more than higher order terms. But this is not a phenomena or situation unique to SLT, and indeed has the same shape as applications of Laplace approximations everywhere (and is a situation also commonly encountered in mathematical physics). At the end of the day such asymptotic expansions are commonly used across applications of mathematics to real world phenomena, they are highly successful, and theory alone cannot tell you when they are valid: you have to actually do experiments.
But it is strange to rule out in principle the application of such techniques to study phenomena at finite , as though this was a theory whose domain of applicability is restricted to enormous . There are separate questions one can ask about effective theories etc, and finite phenomena that are not accounted for by the asymptotic expansions, I don’t mean to dismiss any of that as unimportant (and indeed we think about that kind of thing and continue to work on it). However, I want to push back against some oversimplified characterisation of SLT as a “theory about the infinite limit”.
But it is strange to rule out in principle the application of such techniques to study phenomena at finite
I agree it would be strange to strongly rule out the application of this at finite in principle. I think I’ve made a fine simple defeasible argument against this for the overparametrized case, and I think the claim will turn out to be true with more careful investigation, but I haven’t really carried out a rigorous version of this investigation and certainly haven’t spelled the reasoning out, in my comment above or elsewhere publicly. I agree the argument I gave above is not definitive.
I think you agree that a central crux is whether the and terms the SLT expansion takes to be leading order in the posterior are in fact larger than the “lower order terms”, in the overparametrized case?[1] I would weakly guess that in most of the reasonable cases of NN bayesianism, correct generalization happens at meaningfully smaller than when the terms SLT considers “lower order” in the comparison of posteriors become actually smaller than these “leading terms”[2]. I would guess this strongly about the strongly overparametrized case (the infinite width limit), and reasonably strongly about “mildly overparametrized” cases. Do you agree that if these guesses turn out to be right, then that would be a strong argument against thinking in SLT terms for understanding NNbayes generalization? Ditto for specifically the overparametrized case. If you agree this would be a strong argument, would you like to register opposite guesses to some or all of these?[3]
(Actually, imo the nicest version of NN bayesianism is where learning is just conditioning the NN prior on outputs on train data inputs being closer than some fixed precision, say when the labels are , to the given outputs. Idk how to think about this in SLT terms, given that in the realizable case there is then just a full-dimensional region of solutions, with number-of-mistakes-loss being just const locally around all but the points on the boundary? If SLT aspires to be the theory of NN bayesian learning, it should help us think about this case. But idk what “the terms SLT considers high order” should even mean here, ie I’m not sure what the pro-SLT side of the statements above should even be for this case.[4])
and theory alone cannot tell you when they are valid: you have to actually do experiments
not a very important disagreement but: I disagree that you have to do experiments to figure out if these expansions are valid.[5] There are at least two other things you can do: you can think heuristically about whether the lower-order terms are actually smaller in cases of interest, and you can work out some cases theoretically.
more precisely/correctly, it’s whether the difference between posteriors of different functions with perfect agreement with the data so far mostly comes from these terms + these are useful auxiliary variables to track to make sense of generalization
If you were to agree these would be strong arguments but were not willing to make these guesses, then I’d feel like you’re responding to the claim not-P with sth like “it can remain defensible to think P”, and I guess I’d agree that can remain defensible, but we should figure out what’s true? :)
Btw I think that Dmitry Vaintrob and I can probably prove a generalization result for this case, for certain scalings of the prior “stronger”/”smaller” than the scaling which gives NNGP, at least modulo restricting the support to weight vectors satisfying some “robustness condition” which I’m still unsure if I should think of as being contrived or not (there could also be some less contrived version — work remains on this). Anyway, that argument doesn’t think about local geometry at all. This isn’t published yet, it isn’t fully worked out, it could turn out to not work, and you don’t have to believe me, but hopefully this partly explains where I’m coming from.
Also, I wouldn’t be surprised if the experiments one does are not actually measuring what one thinks they are measuring, eg I wouldn’t be surprised if one’s “RLCT estimator” is not actually close to the RLCT.
Not quite. SLT is for a specific subcase of Bayesian learning only, not SGD. Maybe more importantly for this point, it also doesn’t really show why neural network priors are good, just that neural network priors strongly favour some solutions over others.
Some SLT-adjacent stuff is pretty strongly suggestive of a proper answer, but I don’t think there’s a proper full proof of what we want in generality written up publicly yet.
some more thoughts quickly:
SLT studies the limit as the number of data points goes to infty. this is the opposite of overparametrized! also this seems at least on the face of it like a bizarre setting for studying generalization, which is about guessing correctly after seeing only a small amount of data
i think the subset of weight space corresponding to a function is generally not well thought of as a small local region around any weight vector, especially not in the overparametrized case.
edit added later: however it’s plausible that with the mean field prior scaling you get a contribution to [the prior on a function] from ( of) a macroscopic ball around a certain weight vector which is of [the prior on that function] but however in a weaker sense a decent chunk of the entire prior on that function anyway.
[1]
so in that sense there might be an interesting semi-local thing going on. sorry i’ll need to think more about this
imo it’s good to scrap a bunch of the story given. the part i’d keep is “in cases where NN bayesianism has good generalization properties [2] , simpler functions [3] generally have more prior weight than more complicated functions” (but this is roughly an obvious logical truth that has basically nothing to do with SLT?), and then the question is “why do simpler functions have more prior?”, ie “why do simpler functions have some combination of implementations having smaller weight norm and taking more volume”, and i think one is better off approaching that question basically from scratch. (also this is all about understanding NN bayes. SGD is a meaningfully different thing.)
sorry i’m aware this is very much not clear but making it clear would be a bunch of work and i’m not going to do it atm
which probably isn’t always. eg it’s probably pretty false for the prior scaling that gives NNGP in the wide limit. a good story would be able to “see” this difference between differently scaled gaussian priors
btw the correct meaning of simplicity in this setting is not kolmogorov complexity, but instead circuit size
SLT is not about a limit as the number of data points goes to infinity. Or at least, it is about such a limit insofar as talking about the mean of a random variable is about “studying the limit as the number of data points goes to infinity” which is not how one would normally talk about such things. In particular, I think when people use this phrasing they are (either deliberately or not) making a comparison to “infinite width limits” and I do not think this is a correct analogy.
and the population loss , which is the expectation over with respect to the dataset. The mean of a random variable (in this case a function) is an idealised quantity never encountered in practice, for sure. However, to describe a theory that is organised around means of random variables as being “about” infinite limits seems at odds with how most people would think about statistics.
and the actual you encounter in the real world (and as you say, generalisation has this form: the conceptual content of the theory is a surprising fact, that geometry of the mean object governs the generalisation behaviour at finite , and these are not some exotic effects that are only visible at enormous , as many of the examples in Watanabe’s textbooks will show you).
in the asymptotic expansions that characterise some of the central theorems in SLT. Here it is true that one would expect these analyses to be more correct as becomes larger, and at any value of one cannot a priori rule out that “lower order terms” in fact contribute more than higher order terms. But this is not a phenomena or situation unique to SLT, and indeed has the same shape as applications of Laplace approximations everywhere (and is a situation also commonly encountered in mathematical physics). At the end of the day such asymptotic expansions are commonly used across applications of mathematics to real world phenomena, they are highly successful, and theory alone cannot tell you when they are valid: you have to actually do experiments.
, as though this was a theory whose domain of applicability is restricted to enormous . There are separate questions one can ask about effective theories etc, and finite phenomena that are not accounted for by the asymptotic expansions, I don’t mean to dismiss any of that as unimportant (and indeed we think about that kind of thing and continue to work on it). However, I want to push back against some oversimplified characterisation of SLT as a “theory about the infinite limit”.
So there’s two things here: one is the relation between an empirical loss
Most of the nontrivial mathematical content of SLT is exactly about accounting for the difference between
The other is the role of
But it is strange to rule out in principle the application of such techniques to study phenomena at finite
I agree it would be strange to strongly rule out the application of this at finite in principle. I think I’ve made a fine simple defeasible argument against this for the overparametrized case, and I think the claim will turn out to be true with more careful investigation, but I haven’t really carried out a rigorous version of this investigation and certainly haven’t spelled the reasoning out, in my comment above or elsewhere publicly. I agree the argument I gave above is not definitive.
I think you agree that a central crux is whether the and terms the SLT expansion takes to be leading order in the posterior are in fact larger than the “lower order terms”, in the overparametrized case?
[1]
I would weakly guess that in most of the reasonable cases of NN bayesianism, correct generalization happens at meaningfully smaller than when the terms SLT considers “lower order” in the comparison of posteriors become actually smaller than these “leading terms”
[2]
. I would guess this strongly about the strongly overparametrized case (the infinite width limit), and reasonably strongly about “mildly overparametrized” cases. Do you agree that if these guesses turn out to be right, then that would be a strong argument against thinking in SLT terms for understanding NNbayes generalization? Ditto for specifically the overparametrized case. If you agree this would be a strong argument, would you like to register opposite guesses to some or all of these?
[3]
(Actually, imo the nicest version of NN bayesianism is where learning is just conditioning the NN prior on outputs on train data inputs being closer than some fixed precision, say when the labels are , to the given outputs. Idk how to think about this in SLT terms, given that in the realizable case there is then just a full-dimensional region of solutions, with number-of-mistakes-loss being just const locally around all but the points on the boundary? If SLT aspires to be the theory of NN bayesian learning, it should help us think about this case. But idk what “the terms SLT considers high order” should even mean here, ie I’m not sure what the pro-SLT side of the statements above should even be for this case.
[4]
)
not a very important disagreement but: I disagree that you have to do experiments to figure out if these expansions are valid. [5] There are at least two other things you can do: you can think heuristically about whether the lower-order terms are actually smaller in cases of interest, and you can work out some cases theoretically.
more precisely/correctly, it’s whether the difference between posteriors of different functions with perfect agreement with the data so far mostly comes from these terms + these are useful auxiliary variables to track to make sense of generalization
by this I mean that simpler functions are already preferred earlier
If you were to agree these would be strong arguments but were not willing to make these guesses, then I’d feel like you’re responding to the claim not-P with sth like “it can remain defensible to think P”, and I guess I’d agree that can remain defensible, but we should figure out what’s true? :)
Btw I think that Dmitry Vaintrob and I can probably prove a generalization result for this case, for certain scalings of the prior “stronger”/”smaller” than the scaling which gives NNGP, at least modulo restricting the support to weight vectors satisfying some “robustness condition” which I’m still unsure if I should think of as being contrived or not (there could also be some less contrived version — work remains on this). Anyway, that argument doesn’t think about local geometry at all. This isn’t published yet, it isn’t fully worked out, it could turn out to not work, and you don’t have to believe me, but hopefully this partly explains where I’m coming from.
Also, I wouldn’t be surprised if the experiments one does are not actually measuring what one thinks they are measuring, eg I wouldn’t be surprised if one’s “RLCT estimator” is not actually close to the RLCT.