Thanks, I think I see your point! Still, it seems important to be clear that the core claim—that neural networks are biased toward learning functions with better generalization—is ultimately a claim about the learned function. If I understand correctly, you’re offering one explanation for why this might happen: training induces an implicit search over “effective programs,” and that search has a simplicity bias.
I’m wondering whether your argument can be framed via an analogy to Solomonoff induction. In standard Solomonoff induction, among the functions consistent with the training data, a function’s weight is proportional to the probability that it is computed by a universal Turing machine fed with random bits (for prefix-free machines, this yields a bias toward shorter prefixes). Are you thinking of something analogous for neural networks: among functions consistent with the data, a function’s weight is proportional to the probability that it is computed by a neural-network interpreter when fed with random weights?
Still, it seems important to be clear that the core claim—that neural networks are biased toward learning functions with better generalization—is ultimately a claim about the learned function.
If I understand correctly, I completely agree, with one small nit: “generalization” isn’t really a property of a function in isolation—it’s a property of a learning setup. A function just is what it is; it doesn’t generalize or fail to generalize.
But if I understand what you mean correctly, you’re saying something like: “neural networks are biased toward learning functions that perform well on real-world tasks.” I think this is very important point that should be emphasized. This requires two things to work together: (1) the learner has a simplicity bias, and (2) real-world tasks preferentially admit simple solutions. Neither alone is enough.
And I think you’re pointing to the fact that (2) still needs explanation, which is true. It’s ultimately a property of the real-world, not neural networks, so you need some kind of model-independent notion of “simplicity” like Kolmogorov complexity to make this even coherent. I would 100% agree. There’s more I could say here but it rapidly gets speculative and probably a longer discussion.
In standard Solomonoff induction, among the functions consistent with the training data, a function’s weight is proportional to the probability that it is computed by a universal Turing machine fed with random bits (for prefix-free machines, this yields a bias toward shorter prefixes). Are you thinking of something analogous for neural networks: among functions consistent with the data, a function’s weight is proportional to the probability that it is computed by a neural-network interpreter when initialized fed with random weights?
This is a really sharp question, thank you. I think there’s certainly a Bayesian story of neural networks you could tell here which would work exactly like this (in fact I believe there’s some in-progress SLT work in this direction). I think there’s a good chance this is like 80% of the story.
But there’s some ways that we know SGD must be different. For instance, Bayes should give pseudorandom functions some nontrivial prior weight (they exist in parameter space, they can be hardcoded into the parameters), but SGD will basically never find them (see my discussion on Mixa’s comment). So I think ultimately you need to talk about learning dynamics. Here is also a place where there’s a lot more I could say (I think there’s decent evidence as to what ways SGD differs in the programs it prefers), but it’s a longer discussion.
This is a really sharp question, thank you. I think there’s certainly a Bayesian story of neural networks you could tell here which would work exactly like this (in fact I believe there’s some in-progress SLT work in this direction). I think there’s a good chance this is like 80% of the story.
(2) neural networks have the same specific inductive bias
Indeed no free lunch arguments seem to require any good learner to have good inductive bias. In a sense learning is ‘mostly’ about having the right inductive bias.
We call this specific inductive bias a simplicity bias. Informally it agrees with our intuitive notion of low complexity.
Rk. Conceptually it is a little tricky since simplicity is in the eye of the beholder—by changing the background language we can make anything with high algorithmic complexity have low complexity. People have been working on this problem for a while but at the moment it seems radically tricky.
IIRC Aram Ebtekar has a proposed solution that John Wentworth likes; I haven’t understood it myself yet. I think what one wants to say is that the [algorithmic] mutual information between the observer and the observed is low, where the observer implicitly encodes the universal turing machine used. In other words—the world is such that observers within it observe it to have low complexity with regard to their implicit reference machine.
Regardless, the fact that the real world satisfies a simplicity bias is to my mind difficult to explain without anthropics. I am afraid we may end up having to resort to an appeal to some form of UDASSA but others may have other theological commitments.
That’s the bird-eye view of simplicity bias. If you ignore the above issue and accept some sort of formally-tricky-to-define but informally “reasonable” simplicty then the question becomes: why do neural networks have a bias towards simplicity. Well they have a bias towards degeneracy—and simplicity and bias are intimiately connected, see eg:
“Some form of UDASSA” seems to be right. Why not simply take “difficult to explain otherwise” as evidence (defeasible, of course, like with evidence of physical theories)?
Alex, thanks for the welcome (happy to be here!) and the summary.
I’m generally familiar with this line of thought. My main comment is that the Solomonoff perspective feels somewhat opposite to the usual NFL/“inductive bias” story (where the claim is that a good image model needs a good prior over image statistics, etc.). Yes, a simplicity bias is a kind of inductive bias, but it’s supposed to be universal (domain independent). And if generic architectures really give us this prior “for free”, as suggested by results like Dingle et al., then it seems the hard part isn’t the prior, but rather being able to sample from it conditioned on low training error (i.e., the training process).
That said, this line of reasoning—if taken literally—seems difficult to reconcile with some observed facts, e..g, things like architecture choice and data augmentation do seem to matter for generalization. To me, these suggest that you need some inductive bias beyond algorithmic simplicity alone. (Possibly another way to think about it: the smaller the dataset, the more the additive constant in Kolmogorov complexity starts to matter.)
Yeah, I’ve seen those! They do express similar ideas, and I think Chris does serious work. I completely agree with the claim “the parameter-function map is strongly biased towards simple functions,” basically for SLT reasons (though one has to be careful here to avoid saying something tautological, it’s not as simple as “SLT says learning is biased towards lower LLC solutions, therefore it has a simplicity bias”).
There’s a good blog post discussing this work that I read a few years ago. Keeping in mind I read this three years ago and so this might be unfair, I remember my opinion on that specific paper being something along the lines of “the ideas are great, the empirical evidence seems okay, and I don’t feel great about the (interpretation of the) theory.” In particular I remember reading this comment thread and agreeing with the perspective of interstice, for whatever that’s worth. But I could change my mind.
Thanks, I think I see your point! Still, it seems important to be clear that the core claim—that neural networks are biased toward learning functions with better generalization—is ultimately a claim about the learned function. If I understand correctly, you’re offering one explanation for why this might happen: training induces an implicit search over “effective programs,” and that search has a simplicity bias.
I’m wondering whether your argument can be framed via an analogy to Solomonoff induction. In standard Solomonoff induction, among the functions consistent with the training data, a function’s weight is proportional to the probability that it is computed by a universal Turing machine fed with random bits (for prefix-free machines, this yields a bias toward shorter prefixes). Are you thinking of something analogous for neural networks: among functions consistent with the data, a function’s weight is proportional to the probability that it is computed by a neural-network interpreter when fed with random weights?
If I understand correctly, I completely agree, with one small nit: “generalization” isn’t really a property of a function in isolation—it’s a property of a learning setup. A function just is what it is; it doesn’t generalize or fail to generalize.
But if I understand what you mean correctly, you’re saying something like: “neural networks are biased toward learning functions that perform well on real-world tasks.” I think this is very important point that should be emphasized. This requires two things to work together: (1) the learner has a simplicity bias, and (2) real-world tasks preferentially admit simple solutions. Neither alone is enough.
And I think you’re pointing to the fact that (2) still needs explanation, which is true. It’s ultimately a property of the real-world, not neural networks, so you need some kind of model-independent notion of “simplicity” like Kolmogorov complexity to make this even coherent. I would 100% agree. There’s more I could say here but it rapidly gets speculative and probably a longer discussion.
This is a really sharp question, thank you. I think there’s certainly a Bayesian story of neural networks you could tell here which would work exactly like this (in fact I believe there’s some in-progress SLT work in this direction). I think there’s a good chance this is like 80% of the story.
But there’s some ways that we know SGD must be different. For instance, Bayes should give pseudorandom functions some nontrivial prior weight (they exist in parameter space, they can be hardcoded into the parameters), but SGD will basically never find them (see my discussion on Mixa’s comment). So I think ultimately you need to talk about learning dynamics. Here is also a place where there’s a lot more I could say (I think there’s decent evidence as to what ways SGD differs in the programs it prefers), but it’s a longer discussion.
Interesting! Are you familiar with the paper “Deep neural networks have an inbuilt Occam’s razor”, itself building on the work of “Input–output maps are strongly biased towards simple outputs” by Dingle et al.? I feel like it may be getting at something closely related.
Hi Artemy. Welcome to LessWrong!
Agree completely with what Zach is saying here.
We need two facts
(1) the world has a specific inductive bias
(2) neural networks have the same specific inductive bias
Indeed no free lunch arguments seem to require any good learner to have good inductive bias. In a sense learning is ‘mostly’ about having the right inductive bias.
We call this specific inductive bias a simplicity bias. Informally it agrees with our intuitive notion of low complexity.
Rk. Conceptually it is a little tricky since simplicity is in the eye of the beholder—by changing the background language we can make anything with high algorithmic complexity have low complexity. People have been working on this problem for a while but at the moment it seems radically tricky.
IIRC Aram Ebtekar has a proposed solution that John Wentworth likes; I haven’t understood it myself yet. I think what one wants to say is that the [algorithmic] mutual information between the observer and the observed is low, where the observer implicitly encodes the universal turing machine used. In other words—the world is such that observers within it observe it to have low complexity with regard to their implicit reference machine.
Regardless, the fact that the real world satisfies a simplicity bias is to my mind difficult to explain without anthropics. I am afraid we may end up having to resort to an appeal to some form of UDASSA but others may have other theological commitments.
That’s the bird-eye view of simplicity bias. If you ignore the above issue and accept some sort of formally-tricky-to-define but informally “reasonable” simplicty then the question becomes: why do neural networks have a bias towards simplicity. Well they have a bias towards degeneracy—and simplicity and bias are intimiately connected, see eg:
https://www.lesswrong.com/posts/tDkYdyJSqe3DddtK4/alexander-gietelink-oldenziel-s-shortform?commentId=zH42TS7KDZo9JimTF
“Some form of UDASSA” seems to be right. Why not simply take “difficult to explain otherwise” as evidence (defeasible, of course, like with evidence of physical theories)?
Alex, thanks for the welcome (happy to be here!) and the summary.
I’m generally familiar with this line of thought. My main comment is that the Solomonoff perspective feels somewhat opposite to the usual NFL/“inductive bias” story (where the claim is that a good image model needs a good prior over image statistics, etc.). Yes, a simplicity bias is a kind of inductive bias, but it’s supposed to be universal (domain independent). And if generic architectures really give us this prior “for free”, as suggested by results like Dingle et al., then it seems the hard part isn’t the prior, but rather being able to sample from it conditioned on low training error (i.e., the training process).
That said, this line of reasoning—if taken literally—seems difficult to reconcile with some observed facts, e..g, things like architecture choice and data augmentation do seem to matter for generalization. To me, these suggest that you need some inductive bias beyond algorithmic simplicity alone. (Possibly another way to think about it: the smaller the dataset, the more the additive constant in Kolmogorov complexity starts to matter.)
Yeah, I’ve seen those! They do express similar ideas, and I think Chris does serious work. I completely agree with the claim “the parameter-function map is strongly biased towards simple functions,” basically for SLT reasons (though one has to be careful here to avoid saying something tautological, it’s not as simple as “SLT says learning is biased towards lower LLC solutions, therefore it has a simplicity bias”).
There’s a good blog post discussing this work that I read a few years ago. Keeping in mind I read this three years ago and so this might be unfair, I remember my opinion on that specific paper being something along the lines of “the ideas are great, the empirical evidence seems okay, and I don’t feel great about the (interpretation of the) theory.” In particular I remember reading this comment thread and agreeing with the perspective of interstice, for whatever that’s worth. But I could change my mind.