My name is pronounced “YOO-ar SKULL-se”.
I’m a DPhil Scholar at the Future of Humanity Institute in Oxford.
My name is pronounced “YOO-ar SKULL-se”.
I’m a DPhil Scholar at the Future of Humanity Institute in Oxford.
I believe Chris has now updated the Towards Data Science blog post to be more clear, based on the conversation you had, but I’ll make some clarifications here as well, for the benefit of others:
The key claim, that , is indeed not (meant to be) dependent on any test data per se. The test data comes into the picture because we need a way to granuralise the space of possible functions if we want to compare these two quantities empirically. If we take “the space of functions” to be all the functions that a given neural network can express on the entire vector space in which it is defined, then there would be an uncountably infinite number of such functions, and any given function would never show up more than once in any kind of experiment we could do. We therefore need a way to lump the functions together into sensible buckets, and we decided to do that by looking at what output the function gives on a set of images not used in training. Stated differently, we look at the partial function that the network expresses on a particular subset of the input vector space, consisting in a bunch of points sampled from the underlying data distribution. So, basically, your optimistic guess is correct—the test data is only used as a way to lump functions together into a finite number of sensible buckets (and the test labels are not used).
We do have a lot of empirical evidence that simple functions indeed have bigger volumes in parameter-space (exponentially bigger volumes, in fact). See especially the 2nd and 3rd paper linked in the OP.
It’s true that we don’t have a rigorous mathematical proof for “why” simple functions should have larger volumes in parameter-space, but I can give an intuitive argument for why I think it would make sense to expect this to be the case.
First of all, we can look at the claim “backwards”; suppose a function has a large volume in parameter-space. This means that a lot of the parameters in a sense are redundant, so it should be possible to compress the representation of this function. Conversely, if a function has a small volume in parameter-space, then all the parameters of the network are necessary to pinpoint that function, and so writing out the entire structure of the network might be one of the shortest ways to represent that function. Therefore, we should expect functions with large volumes in parameter-space to be simple, and functions with small volumes to be complex.
There is a mathematical theorem, called the Levin bound, that roughly formalises this intuition. This bound says essentially that if we have a space of functions F, and a space of parameters P, and a parameter-function map m : P → F, where m itself has low complexity relative to the most complex functions in F (and F is overparameterised by P), then m will be exponentially biased towards simple functions (ie, if an element f of F has low complexity, then there is an exponentially larger number of elements of P that map to f via m).
The Levin bound doesn’t apply directly to neural networks, because it assumes that P is finite and discrete, but it gives some extra backing to the intuition above.
Also, check out this.
I agree with everything you have said.
A few things:
1. Neural networks do typically learn functions with low Kolmogorov complexity (otherwise they would not be able to generalise well).
2. It is a type error to describe a function as having low RLCT. A given function may have a high RLCT or a low RLCT, depending on the architecture of the learning machine.
3. The critique is against the supposition that we can use SLT to explain why neural networks generalise well in the small-data regime. The example provides a learning machine which would not generalise well, but which does fit all assumptions made my SLT. Hence, the SLT theorems which appear to prove that learning machines will generalise well when they are subject to the assumptions of SLT must in fact be showing something else.
My point is precisely that SLT does not give us a predictive account of how neural networks behave, in terms of generalisation and inductive bias, because it abstacts away from factors which are necessary to understand generalisation and inductive bias.
In your example there are many values of the parameters that encode the zero function
Ah, yes, I should have made the training data be (1,1), rather than (0,0). I’ve fixed the example now!
Is that a fair characterisation of the argument you want to make?
Yes, that is exactly right!
Assuming it is, my response is as follows. I’m guessing you think is simpler than because the former function can be encoded by a shorter code on a UTM than the latter.
The notion of complexity that I have in mind is even more pre-theoretic than that; it’s something like ” looks like an intuitively less plausible guess than ”. However, if we want to keep things strictly mathematical, then we can substitute this for the definition in terms of UTM codes.
But this isn’t the kind of complexity that SLT talks about
I’m well aware of that—that is what my example attempts to show! My point is that the kind of complexity which SLT talks about does not allow us to make inferences about inductive bias or generalisation behaviour, contra what is claimed e.g. here and here.
So we agree that Kolmogorov complexity and the local learning coefficient are potentially measuring different things. I want to dig deeper into where our disagreement lies, but I think I’ll just post this as-is and make sure I’m not confused about your views up to this point.
As far as I can tell, we don’t disagree about any object-level technical claims. Insofar as we do disagree about something, it may be more methodolocical meta-questions. I think that what would probably be the most important thing to understand about neural networks is their inductive bias and generalisation behaviour, on a fine-grained level, and I don’t think SLT can tell you very much about that. I assume that our disagreement must be about one of those two claims?
Like Rohin, I’m not impressed with the information theoretic side of this work.
Specifically, I’m wary of the focus on measuring complexity for functions between finite sets, such as binary functions.
Mostly, we care about NN generalization on problems where the input space is continuous, generally R^n. The authors argue that the finite-set results are relevant to these problems, because one can always discretize R^n to get a finite set. I don’t think this captures the kinds of function complexity we care about for NNs.
We’re not saying that discrete complexity measures fully capture what we care about for NNs! We do however think that they are sufficiently relevant to be informative for the bigger picture, even if just as a proxy for what we actually care about.
Most complexity measures give roughly similar values for the (relative) complexity of most objects, so our assumption is that if something is the case for a bunch of different tractable complexity measures, then this is also likely to be the case for whatever the “ideal” complexity measure would be in the relevant case. In particular, if regardless of whether K represents Boolean complexity, or LZ complexity, etc, then this is also likely to be true for the “ideal” complexity measure for neural networks.
Also: since we’re estimating various probabilities by sampling, we basically need to discretise the function space. If you have any concrete suggestions for how to get around this then we’re all ears!
As for the rest of your comment—what you’re saying here seems true to me, but I’m not sure I see how any of this is a counterpoint to anything we’re saying?
Ah, I certainly agree with this.
I do not wish to claim that all functions with low Kolmogorov complexity have large volumes in the parameter-space of a sufficiently large neural network. In fact, I can point to several concrete counterexamples to this claim. To give one example, the identity function certainly has a low Kolmogorov complexity, but it’s very difficult for a (fully connected feed-forward) neural network to learn this function (if the input and output is represented in binary form by a bit string). If you try to learn this function by training on only odd numbers then the network will not robustly generalise to even numbers (or vice versa). Similarly, if you train using only number in a certain range then the network will not robustly generalise outside this range. This is because a pattern such as “the n’th input neuron is equal to the n’th output neuron” lacks a simple representation in a neural network (and hence this function has a small parameter-space volume, even though it has low Kolmogorov complexity). The same goes for the function that recognises palindromes, and etc.
So, I agree that there are certain functions with low Kolmogorov complexity that a neural network normally cannot “see” properly. I also think one could frame a lot of the research on developing new neural network architectures as being about making neural networks able to “see” more kinds of functions. For example, NALUs give neural networks the ability to “see” arithmetic relations more easily. I hence certainly think it’s a very relevant question which complexity measure best describes the bias in neural networks (and I think this actually matters for practical problems). Note that the identity function is very smooth.
This is a bit of a tangent, but the Levin bound is actually about Kolmogorov complexity. It’s a fairly simple theorem; the proof is constructive, and basically shows that a given function f which corresponds to many parameters in the parameter-space cannot be too complex, by constructing a simple program which computes f. Very roughly, if the parameter-space is finite and discrete, then we could construct a Huffman code for the function space (where the distribution over the function-space is the distribution that corresponds to the uniform distribution over the parameter-space). We can then make a computer program that computes f by concatenating the Huffman code of f and the parameter-function map m (which gives an upper bound on the Kolmogorov complexity of functions with large volumes). Of course, this theorem does not per se actually apply to neural networks, since it assumes that the parameter-space is finite and discrete, so in this context it’s essentially just an intuition pump.
Anyway I’m guessing you’re probably willing to grant (i), based on SLT or your own views, and would agree the real bone of contention lies with (ii).
Yes, absolutely. However, I also don’t think that (i) is very mysterious, if we view things from a Bayesian perspective. Indeed, it seems natural to say that an ideal Bayesian reasoner should assign non-zero prior probability to all computable models, or something along those lines, and in that case, notions like “overparameterised” no longer seem very significant.
Maybe that has significant overlap with the critique of SLT you’re making?
Yes, this is basically exactly what my criticism of SLT is—I could not have described it better myself!
Again, I think this reduction is not trivial since the link between , and generalisation error is nontrivial.
I agree that this reduction is relevant and non-trivial. I don’t have any objections to this per se. However, I do think that there is another angle of attack on this problem that (to me) seems to get us much closer to a solution (namely, to investigate the properties of the parameter-function map).
including stuff Joar has worked on
That is right! See this paper.
This is obviously true; any AI complete problem can be trivially reduced to the problem of writing an AI program that solves the problem. That isn’t really a problem for the proposal here. The point isn’t that we could avoid making AGI by doing this, the point is that we can do this in order to get AI systems that we can trust without having to solve interpretability.
This is probably true, but the extent to which it is true is unclear. Moreover, if the inner workings of intelligence are fundamentally uninterpretable, then strong interpretability must also fail. I already commented on this in the last two paragraphs of the top-level post.
Yes, I agree with this.
I should note that SGD definitely does make a contribution to the inductive bias, but this contribution does seem to be quite small compared to the contribution that comes from the implicit prior embedded in the parameter-function map. For example, if you look at Figure 6 in the Towards Data Science post, you can see that different versions of SGD give a slightly different inductive bias. It’s also well-known that the inductive bias of neural networks is affected by things like how long you train for, and what step size and batch size you use, etc. However, these effects seem to be quite small compared to the effect that comes from the parameter-function map.
I should also note that I think that the fact that Gaussian processes even work at all already in itself gives us a fairly good reason to expect them to capture most of what makes NNs work in practice. For any given function approximator, if that function approximator is highly expressive then the “null hypothesis” should be that it basically won’t generalise at all. The fact that NNs and GPs both work, and the fact that there is a fairly strong correspondence between them, means that it would be kind of weird if they worked for completely different reasons.