Killing 90% of the human population would not be enough to cause extinction. That would put us at a population of 800 million, higher than the population in 1700.

# interstice

It could be considered an essence, but physical rather than metaphysical.

This feels related to metaphilosophy. In the sense that, (to me) it seems that one of the core difficulties of metaphilosophy is that in coming up with a ‘model’ agent you need to create an agent that is not only capable of thinking about its own structure, but capable of being

*confused*about what that structure is(and presumably, of becoming un-confused). Bayesian etc. approaches can model agents being confused about object-level things, but it’s hard to even imagine what a model of an agent confused about ontology would look like.

Another example of this sort of thing: least-rattling feedback in driven systems.

Perhaps this is a physicist vs mathematician type of thinking though

Good guess ;)

This is not the same as saying that an extremely wide trained-by-random-sampling neural network would not learn features—there is a possibility that the first time you reach 100% training accuracy corresponds to effectively randomly initialised initial layers + trained last layer, but in expectation all the layers should be distinct from an entirely random intialisation.

I see—so you’re saying that even though the distribution of

*output*functions learned by an infinitely-wide randomly-sampled net is unchanged if you freeze everything but the last layer, the distribution of*intermediate*functions might change. If true, this would mean that feature learning and inductive bias are ‘uncoupled’ for infinite-width randomly-sampled nets. I think this is false, however—that is, I think it’s provable that the distribution of intermediate functions does*not*change in the infinite-width limit when you condition on the training data, even when conditioning over all layers. I can’t find a reference offhand though, I’ll report back if I find anything resolving this one way or another.

The claim I am making is that the reason why feature learning is good is not because it improves inductive bias—it is because it allows the network to be compressed. That is probably at the core of our disagreement.

Yes, I think so. Let’s go over the ‘thin network’ example—we want to learn some function which can be represented by a thin network. But let’s say a randomly-initialized thin network’s intermediate functions won’t be able to fit the function—that is (with high probability over the random initialization) we won’t be able to fit the function just by changing the parameters of the last layer. It seems there are a few ways we can alter the network to make fitting possible:

(A) Expand the network’s width until (with high probability) it’s possible to fit the function by only altering the last layer

(B) Keeping the width the same, re-sample the parameters in all layers until we find a setting that can fit the function

(C) Keeping the width the same, train the network with SGD

By hypothesis, all three methods will let us fit the target function. You seem to be saying[I think, correct me if I’m wrong] that all three methods should have the same inductive bias as well. I just don’t see any reason this should be the case—on the face of it, I would guess that all three have different inductive biases(though A and B might be similar). They’re clearly different in some respects -- (C) can do transfer learning but (A) cannot(B is unclear).

What do we know about SGD-trained nets that suggests this?

My intuition here is that SGD-trained nets can learn functions non-linearly while NTK/GP can only do so linearly. So in the car detector example, SGD is able to develop a neuron detecting cars through some as-yet unclear ‘feature learning’ mechanism. The NTK/GP can do so as well, sort of, since they’re universal function approximators. However, the way they do this is by taking a giant linear combination of random functions which is able to function identically to a car detector on the data points given. It seems like this might be more fragile/generalize worse than the neurons produced by SGD. Though that is admittedly somewhat conjectural at this stage, since we don’t really have a great understanding of how feature learning in SGD works.

I’ve read the new feature learning paper! We’re big fans of his work, although again I don’t think it contradicts anything I’ve just said.

ETA: Let me elaborate upon what I see as the significance of the ‘feature learning in infinite nets’ paper. We know that NNGP/NTK models can’t learn features, but SGD can: I think this provides strong evidence that they are learning using different mechanisms, and likely have substantially different inductive biases. The question is whether randomly sampled

*finite*nets can learn features as well. Since they are equivalent to NNGP/NTK at infinite width, any feature learning they do*can only come from finiteness*. In contrast, in the case of SGD, it’s possible to do feature learning*even in the infinite-width limit*. This suggests that even if randomly-sampled finite nets can do feature learning, the mechanism by which they do so is different from SGD, and hence their inductive bias is likely to be different as well.

First thank you for your comments and observations—it’s always interesting to read pushback

And thanks for engaging with my random blog comments! TBC, I think you guys are definitely on the right track in trying to relate SGD to function simplicity, and the empirical work you’ve done fleshing out that picture is great. I just think it could be even

*better*if it was based around a better SGD scaling limit ;)Therefore, if an optimiser samples functions proportional to their volume, you won’t get any difference in performance if you learn features (optimise the whole network) or do not learn features (randomly initialise and freeze all but the last layer and then train just the last).

Right, this is an even better argument that NNGPs/random-sampled nets don’t learn features.

Given therefore that the posteriors are the same, it implies that feature learning is not aiding inductive bias—rather, feature learning is important for expressivity reasons

I think this only applies to NNGP/random-sampled nets, not SGD-trained nets. To apply to SGD-trained nets, you’d need to show that the new features learned by SGD have the same distribution as the features found in an infinitely-wide random net, but I don’t think this is the case. By illustration, some SGD-trained nets can develop expressive neurons like ‘car detector’, enabling them to fit the data with a relatively small number of such neurons. If you used an NNGP to learn the same thing, you wouldn’t get a single ‘car detector’ neuron, but rather some huge linear combination of high-frequency features that can approximate the cars seen in the dataset. I think this would probably generalize worse than the network with an actual ‘car detector’(this isn’t empirical evidence of course, but I think what we know about SGD-trained nets and the NNGP strongly suggests a picture like this)

Furthermore (and on a slightly different note), it is known that infintesimal GD converges to the Boltzmann distribution for any DNN (very similar to random sampling)

Interesting, haven’t seen this before. Just skimming the paper, it sounds like the very small learning rate + added white noise might result in different limiting behavior from usual SGD. Generally it seems that there are a lot of different possible limits one can take; empirically SGD-trained nets do seem to have ‘feature learning’ so I’m skeptical of limits that don’t have that(I assume they don’t have them for theoretical reasons, anyway. Would be interesting to actually examine the features found in networks trained like this, and to see if they can do transfer learning at all) re:‘colored noise’, not sure to what extent this matters. I think a more likely source of discrepancy is the

*lack*of white noise in normal training(I guess this counts as ‘colored noise’ in a sense) and the larger learning rate.if anyone can point out why this line of argument is not correct, or can steelman a case for SGD inductive bias appearing at larger scales, I would be very interested to hear it.

Not to be a broken record, but I strongly recommend checking out Greg Yang’s work. He clearly shows that there exist infinite-width limits of SGD that can do feature/transfer learning.

I think we basically agree on the state of the empirical evidence—the question is just whether NTK/GP/random-sampling methods will continue to match the performance of SGD-trained nets on more complex problems, or if they’ll break down, ultimately being a first-order approximation to some more complex dynamics. I think the latter is more likely, mostly based on the lack of feature learning in NTK/GP/random limits.

re: the architecture being the source of inductive bias—I certainly think this is true in the sense that architecture choice will have a bigger effect on generalization than hyperparameters, or the choice of which local optimizer to use. But I do think that using a local optimizer at all, as opposed to randomly sampling parameters, is likely to have a large effect.

Yeah, I didn’t mean to imply that you guys said ‘simple --> large volume’ anywhere. I just think it’s a point worth emphasizing, especially around here where I think people will imagine “Solomonoff Induction-like” when they hear about a “bias towards simple functions”

Also, very briefly on your comment on feature learning—the GP limit is used to calculate the volume of functions locally to the initialization. The fact that kernel methods do not learn features should not be relevant given this interpretation

But in the infinite-width setting, Bayesian inference in general is given by a GP limit, right? Initialization doesn’t matter. This means that the arguments for lack of feature learning still go through. It’s technically possible that there could be feature learning in

*finite*-width randomly-sampled networks, but it seems strange that finiteness would help here(and any such learning would be experimentally inaccessible). This is a major reason that I’m skeptical of the “SGD as a random sampler” picture.

If your goal is to play as well as the best go bot and/or write a program that plays equally well from scratch, it seems like it’s probably impossible. A lot of the go bot’s ‘knowledge’ could well be things like “here’s a linear combination of 20000 features of the board predictive of winning”. There’s no reason for the coefficients of that linear combination to be compressible in any way; it’s just a mathematical fact that these particular coefficients happen to be the best at predicting winning. If you accepted “here the model is taking a giant linear combination of features” as “understanding” it might be more doable.

Is that the empirical evidence attempts to demonstrate simple --> large volume but is inconclusive, or is it that the empirical evidence does not even attempt to demonstrate simple --> large volume?

They don’t really try to show simple --> large volume. They show is that there is substantial ’clustering, so

*some*simple functions have large volume. I like nostalgebraist’s remarks on their clustering measures.so it seems a little unfair to say that the evidence is that the performance is similar, since that would suggest that they were just comparing max performance by SGD to max performance by NNGP.

Fair point, they do compare the distributions as well. I don’t think it’s too surprising that they’re similar since they compare them on the test points of the distribution which they were trained to fit.

It sounds like you do think there is some chance that neural network generalization is due to an architectural bias towards simplicity

I do, although I’m not sure if I would say ‘architectural bias’ since I think SGD might play an important role. Unfortunately I don’t really have too much substantial to say about this—Mingard is the only researcher I’m aware of explicitly trying to link networks to simplicity priors. I think the most promising way to make progress here is likely to be analyzing neural nets in some non-kernel limit like Greg Yang’s work or this paper.

They would exist in a

*sufficiently*big random NN, but their density would be extremely low I think. Like, if you train a normal neural net with a 15000 neurons and then there’s a car detector, the density of car detectors is now 1/15000. Whereas I think the density at initialization is probably more like 1/2^50 or something like that(numbers completely made up), so they’d have a negligible effect on the NTK’s learning ability(‘slight tweaks’ can’t happen in the NTK regime since no intermediate functions change by definition)A difference with the pruning case is that the number of possible prunings increases exponentially with the number of neurons but the number of neurons is linear. My take on the LTH is that pruning is basically just a weird way of doing optimization so it’s not that surprising you can get good performance.

Yeah, that summary sounds right.

I’d say (b) -- it seems quite unlikely to me that the NTK/GP are universally data-efficient, while neural nets might be(although that’s mostly speculation on my part). I think the lack of feature learning is a stronger argument that NTK/GP don’t characterize neural nets well.

Yeah, exactly—the problem is that there are some small-volume functions which are actually simple. The argument for small-volume --> complex doesn’t go through since there could be other ways of specifying the function.

Other senses of simplicity include various circuit complexities and Levin complexity. There’s no argument that parameter-space volume corresponds to either of them AFAIK(you might think parameter-space volume would correspond to “neural net complexity”, the number of neurons in a minimal-size neural net needed to compute the function, but I don’t think this is true either—every parameter is Gaussian so it’s unlikely for most to be zero)

For reasons elaborated upon in this post and its comments, I’m kinda skeptical of these results. Basically the claims made are

(A) That the parameter->function map is “biased towards simple functions”. It’s important to distinguish simple --> large volume and large volume --> simple. Simple --> large volume is the property that Solomonoff induction has and what makes it universal, but large volume-->simple is what is proven in these papers(plus some empirical evidence of unclear import)

(B) SGD being equivalent to random selection. The evidence is empirical performance of Gaussian processes being similar to neural nets on simple tasks. But this may break down on more difficult problems(link is about the NTK, not GP, but they tend to perform similarly, indeed NTK usually performs better than GP)

Overall I think it’s likely we’ll need to actually analyze SGD in a non-kernel limit to get a satisfactory understanding of “what’s really going on” with neural nets.

There’s an important distinction

^{[1]}to be made between these two claims:A) Every function with large volume in parameter-space is simple

B) Every simple function has a large volume in parameter space

For a method of inference to qualify as a ‘simplicity prior’, you want both claims to hold. This is what lets us derive bounds like ‘Solomonoff induction matches the performance of any computable predictor’, since all of the simple, computable predictors have relatively large volume in the Solomonoff measure, so they’ll be picked out after boundedly many mistakes. In particular, you want there to be an implication like, if a function has complexity , it will have parameter-volume at least .

Now, the Mingard results, at least the ones that have mathematical proof, rely on the Levin bound. This only shows (A), which is the direction that is much easier to prove—it automatically holds for any mapping from parameter-space to functions with bounded complexity. They also have some empirical results that show there is substantial ‘clustering’, that is, there are

*some*simple functions that have large volumes. But this still doesn’t show that all of them do, and indeed is compatible with the learnable function class being extremely limited. For instance, this could easily be the case even if NTK/GP was only able to learn linear functions. In reality the NTK/GP is capable of approximating arbitrary functions on finite-dimensional inputs but, as I argued in another comment, this is not the right notion of ‘universality’ for classification problems. I strongly suspect^{[2]}that the NTK/GP can be shown to not be ‘universally data-efficient’ as I outlined there, but as far as I’m aware no one’s looked into the issue formally yet. Empirically, I think the results we have so far suggest that the NTK/GP is a decent first-order approximation for simple tasks that tends to perform worse on the more difficult problems that require non-trivial feature learning/efficiency.

I actually posted basically the same thing underneath another one of your comments a few weeks ago, but maybe you didn’t see it because it was only posted on LW, not the alignment forum ↩︎

Basically, because in the NTK/GP limit the functions for all the neurons in a given layer are sampled from a single computable distribution, so I think you can show that the embedding is ‘effectively finite’ in some sense(although note it

*is*a universal approximator for fixed input dimension) ↩︎

- 7 May 2021 3:03 UTC; 10 points) 's comment on Parsing Chris Mingard on Neural Networks by (

Have you read much philosophy? If so, what are your favorite books/articles?

Any thoughts on the Neural Tangent Kernel/Gaussian Process line of research? Or attempts to understand neural network training at a theoretical level more generally?

By universal approximation, these features will be sufficient for any downstream learning task

Right, but trying to fit an unknown function with linear combinations of those features might be

*extremely data-inefficient*, such that it is basically unusable for difficult tasks. Of course you could do better if you’re not restricted to linear combinations—for instance, if the map is injective you could invert back to the original space and apply whatever algorithm you wanted. But at that point you’re not really using the Fourier features at all. In particular, the NTK always learns a linear combination of its features, so it’s the efficiency of linear combinations that’s relevant here.I agree that there is no learning taking place and that such a method may be inefficient. However, that goes beyond my original objection.

You originally said that the NTK doesn’t learn features because its feature class already has a good representation at initialization. What I was trying to convey (rather unclearly, admittedly) in response is:

A) There exist learning algorithms that have universal-approximating embeddings at initialization yet learn features. If we have an example of P and !Q, P-->Q cannot hold in general, so I don’t think it’s right to say that the NTK’s lack of feature learning is due to its universal-approximating property.

B) Although the NTK’s representation may be capable of approximating arbitrary functions, it will probably be very

*slow*at learning some of them, perhaps so slow that using it is infeasible. So I would dispute that it already has ‘good’ representations. While it’s universal in one sense, there might be some other sense of ‘universal efficiency’ in which it’s lacking, and where feature-learning algorithms can outperform it.This is not a trivial question

I agree that in practice there’s likely to be some relationship between universal approximation and efficiency, I just think it’s worth distinguishing them conceptually. Thanks for the paper link BTW, it looks interesting.

I actually agree with you there, there was always discussion of GCR along with extinction risks(though I think Eliezer in particular was more focused on extinction risks). However, they’re still distinct categories: even the deadliest of pandemics is unlikely to cause extinction.