You might reconstruct your sacred Jeffries prior with a more refined notion of model identity, which incorporates derivatives (jets on the geometric/statistical side and more of the algorithm behind the model on the logical side).
Except nobody wants to hear about it at parties.
You seem to do OK…
If they only would take the time to explain things simply you would understand.
This is an interesting one. I field this comment quite often from undergraduates, and it’s hard to carve out enough quiet space in a conversation to explain what they’re doing wrong. In a way the proliferation of math on YouTube might be exacerbating this hard step from tourist to troubadour.
As a supervisor of numerous MSc and PhD students in mathematics, when someone finishes a math degree and considers a job, the tradeoffs are usually between meaning, income, freedom, evil, etc., with some of the obvious choices being high/low along (relatively?) obvious axes. It’s extremely striking to see young talented people with math or physics (or CS) backgrounds going into technical AI alignment roles in big labs, apparently maximising along many (or all) of these axes!Especially in light of recent events I suspect that this phenomenon, which appears too good to be true, actually is.
Please develop this question as a documentary special, for lapsed-Starcraft player homeschooling dads everywhere.
Thanks for setting this up!
I don’t understand the strong link between Kolmogorov complexity and generalisation you’re suggesting here. I think by “generalisation” you must mean something more than “low test error”. Do you mean something like “out of distribution” generalisation (whatever that means)?
Well neural networks do obey Occam’s razor, at least according to the formalisation of that statement that is contained in the post (namely, neural networks when formulated in the context of Bayesian learning obey the free energy formula, a generalisation of the BIC which is often thought of as a formalisation of Occam’s razor).I think that expression of Jesse’s is also correct, in context.However, I accept your broader point, which I take to be: readers of these posts may naturally draw the conclusion that SLT currently says something profound about (ii) from my other post, and the use of terms like “generalisation” in broad terms in the more expository parts (as opposed to the technical parts) arguably doesn’t make enough effort to prevent them from drawing these inferences.I have noticed people at the Berkeley meeting and elsewhere believing (ii) was somehow resolved by SLT, or just in a vague sense thinking SLT says something more than it does. While there are hard tradeoffs to make in writing expository work, I think your criticism of this aspect of the messaging around SLT on LW is fair and to the extent it misleads people it is doing a disservice to the ongoing scientific work on this important subject. I’m often critical of the folklore-driven nature of the ML literature and what I view as its low scientific standards, and especially in the context of technical AI safety I think we need to aim higher, in both our technical and more public-facing work. So I’m grateful for the chance to have this conversation (and to anybody reading this who sees other areas where they think we’re falling short, read this as an invitation to let me know, either privately or in posts like this).I’ll discuss the generalisation topic further with the authors of those posts. I don’t want to pre-empt their point of view, but it seems likely we may go back and add some context on (i) vs (ii) in those posts or in comments, or we may just refer people to this post for additional context. Does that sound reasonable?At least right now, the value proposition I see of SLT lies not in explaining the “generalisation puzzle” but in understanding phase transitions and emergent structure; that might end up circling back to say something about generalisation, eventually.
However, I do think that there is another angle of attack on this problem that (to me) seems to get us much closer to a solution (namely, to investigate the properties of the parameter-function map)
Seems reasonable to me!
Re: the articles you link to. I think the second one by Carroll is quite careful to say things like “we can now understand why singular models have the capacity to generalise well” which seems to me uncontroversial, given the definitions of the terms involved and the surrounding discussion. I agree that Jesse’s post has a title “Neural networks generalize because of this one weird trick” which is clickbaity, since SLT does not in fact yet explain why neural networks appear to generalise well on many natural datasets. However the actual article is more nuanced, saying things like “SLT seems like a promising route to develop a better understanding of generalization and the limiting dynamics of training”. Jesse gives a long list of obstacles to walking this route. I can’t find anything in the post itself to object to. Maybe you think its optimism is misplaced, and fair enough.So I don’t really understand what claims about inductive bias or generalisation behaviour in these posts you think is invalid?
I think that what would probably be the most important thing to understand about neural networks is their inductive bias and generalisation behaviour, on a fine-grained level, and I don’t think SLT can tell you very much about that. I assume that our disagreement must be about one of those two claims?
That seems probable. Maybe it’s useful for me to lay out a more or less complete picture of what I think SLT does say about generalisation in deep learning in its current form, so that we’re on the same page. When people refer to the “generalisation puzzle” in deep learning I think they mean two related but distinct things: (i) the general question about how it is possible for overparametrised models to have good generalisation error, despite classical interpretations of Occam’s razor like the BIC (ii) the specific question of why neural networks, among all possible overparametrised models, actually have good generalisation error in practice (saying this is possible is much weaker than actually explaining why it happens).In my mind SLT comes close to resolving (i), modulo a bunch of questions which include: whether the asymptotic limit taking the dataset size to infinity is appropriate in practice, the relationship between Bayesian generalisation error and test error in the ML sense (comes down largely to Bayesian posterior vs SGD), and whether hypotheses like relative finite variance are appropriate in the settings we care about. If all those points were treated in a mathematically satisfactory way, I would feel that the general question is completely resolved by SLT.Informally, knowing SLT just dispels the mystery of (i) sufficiently that I don’t feel personally motivated to resolve all these points, although I hope people work on them. One technical note on this: there are some brief notes in SLT6 arguing that “test error” as a model selection principle in ML, presuming some relation between the Bayesian posterior and SGD, is similar to selecting models based on what Watanabe calls the Gibbs generalisation error, which is computed by both the RLCT and singular fluctuation. Since I don’t think it’s crucial to our discussion I’ll just elide the difference between Gibbs generalisation error in the Bayesian framework and test error in ML, but we can return to that if it actually contains important disagreement.Anyway I’m guessing you’re probably willing to grant (i), based on SLT or your own views, and would agree the real bone of contention lies with (ii).Any theoretical resolution to (ii) has to involve some nontrivial ingredient that actually talks about neural networks, as opposed to general singular statistical models. The only specific results about neural networks and generalisation in SLT are the old results about RLCTs of tanh networks, more recent bounds on shallow ReLU networks, and Aoyagi’s upcoming results on RLCTs of deep linear networks (particularly that the RLCT is bounded above even when you take the depth to infinity). As I currently understand them, these results are far from resolving (ii). In its current form SLT doesn’t supply any deep reason for why neural networks in particular are often observed to generalise well when you train them on a range of what we consider “natural” datasets. We don’t understand what distinguishes neural networks from generic singular models, nor what we mean by “natural”. These seem like hard problems, and at present it looks like one has to tackle them in some form to really answer (ii).Maybe that has significant overlap with the critique of SLT you’re making?Nonetheless I think SLT reduces the problem in a way that seems nontrivial. If we boil the “ML in-practice model selection” story to “choose the model with the best test error given fixed training steps” and allow some hand-waving in the connection between training steps and number of samples, Gibbs generalisation error and test error etc, and use Watanabe’s theorems (see Appendix B.1 of the quantifying degeneracy paper for a local formulation) to write the Gibbs generalisation error asGg(n)=L0+1n(λ+ν)where λ is the learning coefficient and ν is the singular fluctuation and L0 is roughly the loss (the quantity that we can estimate from samples is actually slightly different, I’ll elide this) then (ii), which asks why neural networks on natural datasets have low generalisation error, is at least reduced to the question of why neural networks on natural datasets have low L0,λ,ν.I don’t know much about this question, and agree it is important and outstanding.Again, I think this reduction is not trivial since the link between λ,ν and generalisation error is nontrivial. Maybe at the end of the day this is the main thing we in fact disagree on :)
The easiest way to explain why this is the case will probably be to provide an example. Suppose we have a Bayesian learning machine with 15 parameters, whose parameter-function map is given byf(x)=θ1+θ2θ3x+θ4θ5θ6x2+θ7θ8θ9θ10x3+θ11θ12θ13θ14θ15x4,and whose loss function is the KL divergence. This learning machine will learn 4-degree polynomials. Moreover, it is overparameterised, and its loss function is analytic in its parameters, etc, so SLT will apply to it.
The easiest way to explain why this is the case will probably be to provide an example. Suppose we have a Bayesian learning machine with 15 parameters, whose parameter-function map is given byf(x)=θ1+θ2θ3x+θ4θ5θ6x2+θ7θ8θ9θ10x3+θ11θ12θ13θ14θ15x4,
and whose loss function is the KL divergence. This learning machine will learn 4-degree polynomials. Moreover, it is overparameterised, and its loss function is analytic in its parameters, etc, so SLT will apply to it.
In your example there are many values of the parameters that encode the zero function (e.g.θ1=θ2=θ4=θ7=θ11=0 and all other parameters free) in addition to there being many parameters that encode the function x4 (e.g. θ1=θ2=θ4=θ7=0, variables θ3,θ5,θ6,θ8,θ9,θ10 free and θ11θ12θ13θ14θ15=1). Without thinking about it more I’m not sure which is actually has local learning coefficient (RLCT) and therefore counts as “more simple” from an SLT perspective.However, if I understand correctly it’s not this specific example that you care about. We can agree that there is some way of coming up with a simple model which (a) can represent both the functions x↦0 and x↦x2 and (b) has parameters w∗0 and w∗square respectively representing these functions with local learning coefficients λ(w∗0)>λ(w∗square). That is, according to the local learning coefficient as a measure of model complexity, the neighbourhood of the parameter w∗0 is more complex than that of w∗square. I believe your observation is that this contradicts an a priori notion of complexity that you hold about these functions.Is that a fair characterisation of the argument you want to make?Assuming it is, my response is as follows. I’m guessing you think x↦0 is simpler than x↦x2 because the former function can be encoded by a shorter code on a UTM than the latter. But this isn’t the kind of complexity that SLT talks about: the local learning coefficient λ(w∗) that appears in the main theorems represents the complexity of representing a given probability distribution p(x|w∗) using parameters from the model, and is not some intrinsic model-free complexity of the distribution itself.One way of saying it is that Kolmogorov complexity is the entropy cost of specifying a machine on the description tape of a UTM (a kind of absolute measure) whereas the local learning coefficient is the entropy cost per sample of incrementally refining an almost true parameter in the neural network parameter space (a kind of relative measure). I believe they’re related but not the same notion, as the latter refers fundamentally to a search process that is missing in the former.We can certainly imagine a learning machine set up in such a way that it is prohibitively expensive to refine an almost true parameter nearby a solution that looks like x↦0 and very cheap to refine an almost true parameter near a solution like x↦x2, despite that being against our natural inclination to think of the former as simpler. It’s about the nature of the refinement / search process, not directly about the intrinsic complexity of the functions.So we agree that Kolmogorov complexity and the local learning coefficient are potentially measuring different things. I want to dig deeper into where our disagreement lies, but I think I’ll just post this as-is and make sure I’m not confused about your views up to this point.
First of all, SLT is largely is based on examining the behaviour of learning machines in the limit of infinite data
I have often said that SLT is not yet a theory of deep learning, this question of whether the infinite data limit is really the right one being among one of the main question marks I currently see (I think I probably also see the gap between Bayesian learning and SGD as bigger than you do).I’ve discussed this a bit with my colleague Liam Hodgkinson, whose recent papers https://arxiv.org/abs/2307.07785 and https://arxiv.org/abs/2311.07013 might be more up your alley than SLT.My view is that the validity of asymptotics is an empirical question, not something that is settled at the blackboard. So far we have been pleasantly surprised at how well the free energy formula works at relatively low n (in e.g. https://arxiv.org/abs/2310.06301). It remains an open question whether this asymptotic continues to provide useful insight into larger models with the kind of dataset size we’re using in LLMs for example.
I think that the significance of SLT is somewhat over-hyped at the moment
Haha, on LW that is either already true or at current growth rates will soon be true, but it is clearly also the case that SLT remains basically unknown in the broader deep learning theory community.
I claim that this is fairly uninteresting, because classical statistical learning theory already gives us a fully adequate account of generalisation in this setting which applies to all learning machines, including neural networks
I’m a bit familiar with the PAC-Bayes literature and I think this might be an exaggeration. The linked post merely says that the traditional PAC-Bayes setup must be relaxed, and sketches some ways of doing so. Could you please cite the precise theorem you have in mind?
Very loosely speaking, regions with a low RLCT have a larger “volume” than regions with high RLCT, and the impact of this fact eventually dominates other relevant factors.
I’m going to make a few comments as I read through this, but first I’d like to thank you for taking the time to write this down, since it gives me an opportunity to think through your arguments in a way I wouldn’t have done otherwise.Regarding the point about volume. It is true that the RLCT can be written as (Theorem 7.1 of Watanabe’s book “Algebraic Geometry and Statistical Learning Theory”)λ=limt→0log(V(at)/V(t))loga
where V(t)=∫K(w)<tφ(w)dw is the volume (according to the measure associated to the prior) of the set of parameters w with KL divergence K(w) between the model and truth less than t. For small t we have V(t)≈ctλ(−logt)m−1 where m is the multiplicity. Thus near critical points w∗ with lower RLCT small changes in the cutoff t near t≈0 tend to change the volume of the set of almost true parameters more than near critical points with higher RLCTs.My impression is that you tend to see this as a statement about flatness, holding over macroscopic regions of parameter space, and so you read the asymptotic formula for the free energy (where Wα is a region of parameter space containing a critical point w∗α)Fn(Wα)≈nLn(w∗α)+λαlogn−(m−1)loglogn+OP(1)as having a logn term that does little more than prefer critical points w∗α that tend to dominate large regions of parameter space according to the prior. If that were true, I would agree this would be underwhelming (or at least, precisely as “whelming” as the BIC, and therefore not adding much beyond the classical story).However this isn’t what the free energy formula says. Indeed the volume ∫Wαφ(w)dw is a term that contributes only to the constant order term (this is sketched in Chen et al). I claim it’s better to think of the learning coefficient λ as being a measure of how many bits it takes to specify an almost true parameter with K(w)<1n+1 once you know a parameter with K(w)<1/n, which is “microscopic” rather than “macroscopic” statement. That is, lower λ means that a fixed decrease ΔK is “cheaper” in terms of entropy generated.So the free energy formula isn’t saying “critical points w∗α dominating large regions tend to dominate the posterior at large n” but rather “critical points w∗α which require fewer bits / less entropy to achieve a fixed ΔK dominate the posterior for large n”. The former statement is both false and uninteresting, the second statement is true and interesting (or I think so anyway).
Good question. What counts as a “-” is spelled out in the paper, but it’s only outlined here heuristically. The “5 like” thing it seems to go near on the way down is not actually a critical point.
The change in the matrix W and the bias b happen at the same time, it’s not a lagging indicator.
SLT predicts when this will happen!
Maybe. This is potentially part of the explanation for “data double descent” although I haven’t thought about it beyond the 5min I spent writing that page and the 30min I spent talking about it with you at the June conference. I’d be very interested to see someone explore this more systematically (e.g. in the setting of Anthropic’s “other” TMS paper https://www.anthropic.com/index/superposition-memorization-and-double-descent which contains data double descent in a setting where the theory of our recent TMS paper might allow you to do something).
There is quite a large literature on “stage-wise development” in neuroscience and psychology, going back to people like Piaget but quite extensively developed in both theoretical and experimental directions. One concrete place to start on the agenda you’re outlining here might be to systematically survey that literature from an SLT-informed perspective.
we can copy the relevant parts of the human brain which does the things our analysis of our models said they would do wrong, either empirically (informed by theory of course), or purely theoretically if we just need a little bit of inspiration for what the relevant formats need to look like.
I struggle to follow you guys in this part of the dialogue, could you unpack this a bit for me please?