My name is pronounced “YOO-ar SKULL-se”.
I’m a DPhil Scholar at the Future of Humanity Institute in Oxford.
My name is pronounced “YOO-ar SKULL-se”.
I’m a DPhil Scholar at the Future of Humanity Institute in Oxford.
You’re right, I put the parameters the wrong way around. I have fixed it now, thanks!
I could have changed it to Why Neural Networks can obey Occam’s Razor, but I think this obscures the main point.
I think even this would be somewhat inaccurate (in my opinion). If a given parametric Bayesian learning machine does obey (some version of) Occam’s razor, then this must be because of some facts related to its prior, and because of some facts related to its parameter-function map. SLT does not say very much about either of these two things. What the post is about is primarily the relationship between the RLCT and posterior probability, and how this relationship can be used to reason about training dynamics. To connect this to Occam’s razor (or inductive bias more broadly), further assumptions and claims would be required.
At the time of writing, basically nobody knew anything about SLT
Yes, thank you so much for taking the time to write those posts! They were very helpful for me to learn the basics of SLT.
As we discussed at Berkeley, I do like the polynomial example you give and this whole discussion has made me think more carefully about various aspects of the story, so thanks for that.
I’m very glad to hear that! :)
My inclination is that the polynomial example is actually quite pathological and that there is a reasonable correlation between the RLCT and Kolmogorov complexity in practice
Yes, I also believe that! The polynomial example is definitely pathological, and I do think that low almost certainly is correlated with simplicity in the case of neural networks. My point is more that the mathematics of SLT does not explain generalisation, and that additional assumptions definitely will be needed to derive specific claims about the inductive bias of neural networks.
Well neural networks do obey Occam’s razor, at least according to the formalisation of that statement that is contained in the post (namely, neural networks when formulated in the context of Bayesian learning obey the free energy formula, a generalisation of the BIC which is often thought of as a formalisation of Occam’s razor).
Would that not imply that my polynomial example also obeys Occam’s razor?
However, I accept your broader point, which I take to be: readers of these posts may naturally draw the conclusion that SLT currently says something profound about (ii) from my other post, and the use of terms like “generalisation” in broad terms in the more expository parts (as opposed to the technical parts) arguably doesn’t make enough effort to prevent them from drawing these inferences.
Yes, I think this probably is the case. I also think the vast majority of readers won’t go deep enough into the mathematical details to get a fine-grained understanding of what the maths is actually saying.
I’m often critical of the folklore-driven nature of the ML literature and what I view as its low scientific standards, and especially in the context of technical AI safety I think we need to aim higher, in both our technical and more public-facing work.
Yes, I very much agree with this too.
Does that sound reasonable?
Yes, absolutely!
At least right now, the value proposition I see of SLT lies not in explaining the “generalisation puzzle” but in understanding phase transitions and emergent structure; that might end up circling back to say something about generalisation, eventually.
I also think that SLT probably will be useful for understanding phase shifts and training dynamics (as I also noted in my post above), so we have no disagreements there either.
I think I recall reading that, but I’m not completely sure.
Note that the activation function affects the parameter-function map, and so the influence of the activation function is subsumed by the general question of what the parameter-function map looks like.
I’m not sure, but I think this example is pathological.
Yes, it’s artificial and cherry-picked to make a certain rhetorical point as simply as possible.
This is the more relevant and interesting kind of symmetry, and it’s easier to see what this kind of symmetry has to do with functional simplicity: simpler functions have more local degeneracies.¨
This is probably true for neural networks in particular, but mathematically speaking, it completely depends on how you parameterise the functions. You can create a parameterisation in which this is not true.
You can make the same critique of Kolmogorov complexity.
Yes, I have been using “Kolmogorov complexity” in a somewhat loose way here.
Wild conjecture: [...]
Is this not satisfied trivially due to the fact that the RLCT has a certain maximum and minimum value within each model class? (If we stick to the assumption that is compact, etc.)
Will do, thank you for the reference!
Yes, I completely agree. The theorems that have been proven by Watanabe are of course true and non-trivial facts of mathematics; I do not mean to dispute this. What I do criticise is the magnitude of the significance of these results for the problem of understanding the behaviour of deep learning systems.
Thank you for this—I agree with what you are saying here. In the post, I went with a somewhat loose equivocation between “good priors” and “a prior towards low Kolmogorov complexity”, but this does skim past a lot of nuance. I do also very much not want to say that the DNN prior is exactly towards low Kolmogorov complexity (this would be uncomputable), but only that it is mostly correlated with Kolmogorov complexity for typical problems.
Yes, I mostly just mean “low test error”. I’m assuming that real-world problems follow a distribution that is similar to the Solomonoff prior (i.e., that data generating functions are more likely to have low Kolmogorov complexity than high Kolmogorov complexity) -- this is where the link is coming from. This is an assumption about the real world, and not something that can be established mathematically.
I think that it gives us an adequate account of generalisation in the limit of infinite data (or, more specifically, in the case where we have enough data to wash out the influence of the inductive bias); this is what my original remark was about. I don’t think classical statistical learning theory gives us an adequate account of generalisation in the setting where the training data is small enough for our inductive bias to still matter, and it only has very limited things to say about out-of-distribution generalisation.
The assumption that small neural networks are a good match for the actual data generating process of the world, is equivalent to the assumption that neural networks have an inductive bias that gives large weight to the actual data generating process of the world, if we also append the claim that neural networks have an inductive bias that gives large weight to functions which can be described by small neural networks (and this latter claim is not too difficult to justify, I think).
I think the second one by Carroll is quite careful to say things like “we can now understand why singular models have the capacity to generalise well” which seems to me uncontroversial, given the definitions of the terms involved and the surrounding discussion.
The title of the post is Why Neural Networks obey Occam’s Razor! It also cites Zhang et al, 2017, and immediately after this says that SLT can help explain why neural networks have the capacity to generalise well. This gives the impression that the post is intended to give a solution to problem (ii) in your other comment, rather than a solution to problem (i).
Jesse’s post includes the following expression:
I think this also suggests an equivocation between the RLCT measure and practical generalisation behaviour. Moreover, neither post contains any discussion of the difference between (i) and (ii).
Anyway I’m guessing you’re probably willing to grant (i), based on SLT or your own views, and would agree the real bone of contention lies with (ii).
Yes, absolutely. However, I also don’t think that (i) is very mysterious, if we view things from a Bayesian perspective. Indeed, it seems natural to say that an ideal Bayesian reasoner should assign non-zero prior probability to all computable models, or something along those lines, and in that case, notions like “overparameterised” no longer seem very significant.
Maybe that has significant overlap with the critique of SLT you’re making?
Yes, this is basically exactly what my criticism of SLT is—I could not have described it better myself!
Again, I think this reduction is not trivial since the link between , and generalisation error is nontrivial.
I agree that this reduction is relevant and non-trivial. I don’t have any objections to this per se. However, I do think that there is another angle of attack on this problem that (to me) seems to get us much closer to a solution (namely, to investigate the properties of the parameter-function map).
A few things:
1. Neural networks do typically learn functions with low Kolmogorov complexity (otherwise they would not be able to generalise well).
2. It is a type error to describe a function as having low RLCT. A given function may have a high RLCT or a low RLCT, depending on the architecture of the learning machine.
3. The critique is against the supposition that we can use SLT to explain why neural networks generalise well in the small-data regime. The example provides a learning machine which would not generalise well, but which does fit all assumptions made my SLT. Hence, the SLT theorems which appear to prove that learning machines will generalise well when they are subject to the assumptions of SLT must in fact be showing something else.
My point is precisely that SLT does not give us a predictive account of how neural networks behave, in terms of generalisation and inductive bias, because it abstacts away from factors which are necessary to understand generalisation and inductive bias.
To say that neural networks are empirical risk minimisers is just to say that they find functions with globally optimal training loss (and, if they find functions with a loss close to the global optimum, then they are approximate empirical risk minimisers, etc).
I think SLT effectively assumes that neural networks are (close to being) empirical risk minimisers, via the assumption that they are trained by Bayesian induction.
The bounds are not exactly vacuous—in fact, they are (in a sense) tight. However, they concern a somewhat adversarial setting, where the data distribution may be selected arbitrarily (including by making it maximally opposed to the inductive bias of the learning algorithm). This means that the bounds end up being much larger than what you would typically observe in practice, if you give typical problems to a learning algorithm whose inductive bias is attuned to the structure of “typical” problems.
If you move from this adversarial setting to a more probabilistic setting, where you assume a fixed distribution over or , then you may be able to prove tighter probabilistic bounds. However, I do not have any references of places where this actually has been done (and as far as I know, it has not been done before).
I already posted this in response to Daniel Murfet, but I will copy it over here:
For example, the agnostic PAC-learning theorem says that if a learning machine (for binary classification) is an empirical risk minimiser with VC dimension , then for any distribution over , if is given access to at least data points sampled from , then it will with probability at least learn a function whose (true) generalisation error (under ) is at most worse than the best function which is able to express (in terms of its true generalisation error under ). If we assume that that corresponds to a function which can express, then the generalisation error of will with probability at least be at most .
This means that, in the limit of infinite data, will with probability arbitrarily close to 1 learn a function whose error is arbitrarily close to the optimal value (among all functions which is able to express). Thus, any empirical risk minimiser with a finite VC-dimension will generalise well in the limit of infinite data.
For a bit more detail, see this post.
Does this not essentially amount to just assuming that the inductive bias of neural networks in fact matches the prior that we (as humans) have about the world?
This is basically a justification of something like your point 1, but AFAICT it’s closer to a proof in the SLT setting than in your setting.
I think it could probably be turned into a proof in either setting, at least if we are allowed to help ourselves to assumptions like “the ground truth function is generated by a small neural net” and “learning is done in a Bayesian way”, etc.
In your example there are many values of the parameters that encode the zero function
Ah, yes, I should have made the training data be (1,1), rather than (0,0). I’ve fixed the example now!
Is that a fair characterisation of the argument you want to make?
Yes, that is exactly right!
Assuming it is, my response is as follows. I’m guessing you think is simpler than because the former function can be encoded by a shorter code on a UTM than the latter.
The notion of complexity that I have in mind is even more pre-theoretic than that; it’s something like ” looks like an intuitively less plausible guess than ”. However, if we want to keep things strictly mathematical, then we can substitute this for the definition in terms of UTM codes.
But this isn’t the kind of complexity that SLT talks about
I’m well aware of that—that is what my example attempts to show! My point is that the kind of complexity which SLT talks about does not allow us to make inferences about inductive bias or generalisation behaviour, contra what is claimed e.g. here and here.
So we agree that Kolmogorov complexity and the local learning coefficient are potentially measuring different things. I want to dig deeper into where our disagreement lies, but I think I’ll just post this as-is and make sure I’m not confused about your views up to this point.
As far as I can tell, we don’t disagree about any object-level technical claims. Insofar as we do disagree about something, it may be more methodolocical meta-questions. I think that what would probably be the most important thing to understand about neural networks is their inductive bias and generalisation behaviour, on a fine-grained level, and I don’t think SLT can tell you very much about that. I assume that our disagreement must be about one of those two claims?
I suppose this depends on what you mean by “most”. DNNs and CNNs have noticeable and meaningful differences in their (macroscopic) generalisation behaviour, and these differences are due to their parameter-function map. This is also true of LSTMs vs transformers, and so on. I think it’s fairly likely that these kinds of differences could have a large impact on the probability that a given type model will learn to exhibit goal-directed behaviour in a given training setup, for example.
Do you mean the loss landscape in the limit of infinite data, or the loss landscape for a “small” amount of data? In the former case, the loss landscape determines the parameter-function map over the data distribution. In the latter case, my guess would be that the statement probably is false (though I’m not sure).