Thanks for this note, I really enjoyed reading it.
One terminological point: I’m a bit confused by your use of “program” / “effective program”, especially where you contrast “programs” with “neural networks”. In algorithmic information theory, a program is any string that computes a function on a UTM. Under that definition, a particular neural network (topology + weights, plus maybe a short ‘interpreter’) already is a program. The fact that weights are often modeled as continuous, while UTM inputs are discrete, doesn’t seem like a key difference, since weights are always discretized in practice.
So, I suspect your discussion of “finding short programs” is more about “finding simple functions”, i.e., low Kolmogorov-complexity functions with short compressed descriptions. That reformulation also makes more sense in light of generalization, since generalization is a property of the computed function, not the specific implementation (a lookup-table and a Fourier-style implementation of modular arithmetic generalize the same way, if they compute the same function).
In Solomonoff induction, the distinction between “finding short programs” and “finding simple functions” doesn’t really matter, since they end up being essentially the same. However, it seems this distinction is important in the setting of machine learning, where we are literally searching over a space of programs (parameterized neural networks) that all have the same description length (number of parameters * precision).
Thanks for this comment, I think this is a pretty crucial but subtle clarification that I hard time conveying.
You’re right that there’s some tension here. A neural network already trivially is a program. And you’re right that in Solomonoff induction, “short program” and “simple function” collapse together by definition.
But I think the reformulation to “finding simple functions” loses something crucial. The key is that there’s a three-layer picture, not two:
Parameters → Effective Program → Function
I tried to gesture at this in the post:
The network’s architecture trivially has compositional structure—the forward pass is executable on a computer. That’s not the claim. The claim is that training discovers an effective program within this substrate. Think of an FPGA: a generic grid of logic components that a hardware engineer configures into a specific circuit. The architecture is the grid; the learned weights are the configuration.”
As in, an FPGA configuration doesn’t directly specify a function—it specifies a circuit, which then computes a function. All configurations have the same bit-length, but the circuits they implement vary enormously in complexity. You make the effective circuit simpler than the full substrate with dead logic blocks, redundant paths, etc.
Now, when it comes to neural networks, all networks in an architecture class have the same parameter count, as you note. But through degeneracies (dead neurons, low-rank weights, redundant structure, and a whole lot more), the “effective program” can be far simpler than the parameter count suggests.
Now, why does this middle layer matter? Why not just talk about functions directly?
Because there’s no notion of function simplicity that doesn’t route through implementations. To say a function is “simple” (low Kolmogorov complexity) is to say there exists a short program computing it. But “exists” quantifies over all possible programs. Gradient descent doesn’t get access to that—it only sees the implementation it actually has, and the immediate neighborhood in parameter space. Two programs computing the same function aren’t interchangeable from the learner’s perspective; only one of them is actually in front of you, and that’s what determines where you can go next.
This means two parameter configurations that implement literally the same function everywhere can still behave completely differently under training. They sit at different points in the loss landscape. They have different stability properties. They’re differently accessible from other regions of parameter space. The learning process distinguishes them even though the function doesn’t. (The concrete mechanism is the parameter-function map; the loss is the same if the function is the same, but the gradients are not.)
This is why I actually think it’s important to stick with “finding simple (effective) programs” over “finding simple functions,” as the distinction really does matter.
Now of course, making precise what “effective program” even means is not obvious at all. That’s where I start talking about singular learning theory :)
Thanks, I think I see your point! Still, it seems important to be clear that the core claim—that neural networks are biased toward learning functions with better generalization—is ultimately a claim about the learned function. If I understand correctly, you’re offering one explanation for why this might happen: training induces an implicit search over “effective programs,” and that search has a simplicity bias.
I’m wondering whether your argument can be framed via an analogy to Solomonoff induction. In standard Solomonoff induction, among the functions consistent with the training data, a function’s weight is proportional to the probability that it is computed by a universal Turing machine fed with random bits (for prefix-free machines, this yields a bias toward shorter prefixes). Are you thinking of something analogous for neural networks: among functions consistent with the data, a function’s weight is proportional to the probability that it is computed by a neural-network interpreter when fed with random weights?
Still, it seems important to be clear that the core claim—that neural networks are biased toward learning functions with better generalization—is ultimately a claim about the learned function.
If I understand correctly, I completely agree, with one small nit: “generalization” isn’t really a property of a function in isolation—it’s a property of a learning setup. A function just is what it is; it doesn’t generalize or fail to generalize.
But if I understand what you mean correctly, you’re saying something like: “neural networks are biased toward learning functions that perform well on real-world tasks.” I think this is very important point that should be emphasized. This requires two things to work together: (1) the learner has a simplicity bias, and (2) real-world tasks preferentially admit simple solutions. Neither alone is enough.
And I think you’re pointing to the fact that (2) still needs explanation, which is true. It’s ultimately a property of the real-world, not neural networks, so you need some kind of model-independent notion of “simplicity” like Kolmogorov complexity to make this even coherent. I would 100% agree. There’s more I could say here but it rapidly gets speculative and probably a longer discussion.
In standard Solomonoff induction, among the functions consistent with the training data, a function’s weight is proportional to the probability that it is computed by a universal Turing machine fed with random bits (for prefix-free machines, this yields a bias toward shorter prefixes). Are you thinking of something analogous for neural networks: among functions consistent with the data, a function’s weight is proportional to the probability that it is computed by a neural-network interpreter when initialized fed with random weights?
This is a really sharp question, thank you. I think there’s certainly a Bayesian story of neural networks you could tell here which would work exactly like this (in fact I believe there’s some in-progress SLT work in this direction). I think there’s a good chance this is like 80% of the story.
But there’s some ways that we know SGD must be different. For instance, Bayes should give pseudorandom functions some nontrivial prior weight (they exist in parameter space, they can be hardcoded into the parameters), but SGD will basically never find them (see my discussion on Mixa’s comment). So I think ultimately you need to talk about learning dynamics. Here is also a place where there’s a lot more I could say (I think there’s decent evidence as to what ways SGD differs in the programs it prefers), but it’s a longer discussion.
This is a really sharp question, thank you. I think there’s certainly a Bayesian story of neural networks you could tell here which would work exactly like this (in fact I believe there’s some in-progress SLT work in this direction). I think there’s a good chance this is like 80% of the story.
(2) neural networks have the same specific inductive bias
Indeed no free lunch arguments seem to require any good learner to have good inductive bias. In a sense learning is ‘mostly’ about having the right inductive bias.
We call this specific inductive bias a simplicity bias. Informally it agrees with our intuitive notion of low complexity.
Rk. Conceptually it is a little tricky since simplicity is in the eye of the beholder—by changing the background language we can make anything with high algorithmic complexity have low complexity. People have been working on this problem for a while but at the moment it seems radically tricky.
IIRC Aram Ebtekar has a proposed solution that John Wentworth likes; I haven’t understood it myself yet. I think what one wants to say is that the [algorithmic] mutual information between the observer and the observed is low, where the observer implicitly encodes the universal turing machine used. In other words—the world is such that observers within it observe it to have low complexity with regard to their implicit reference machine.
Regardless, the fact that the real world satisfies a simplicity bias is to my mind difficult to explain without anthropics. I am afraid we may end up having to resort to an appeal to some form of UDASSA but others may have other theological commitments.
That’s the bird-eye view of simplicity bias. If you ignore the above issue and accept some sort of formally-tricky-to-define but informally “reasonable” simplicty then the question becomes: why do neural networks have a bias towards simplicity. Well they have a bias towards degeneracy—and simplicity and bias are intimiately connected, see eg:
“Some form of UDASSA” seems to be right. Why not simply take “difficult to explain otherwise” as evidence (defeasible, of course, like with evidence of physical theories)?
Alex, thanks for the welcome (happy to be here!) and the summary.
I’m generally familiar with this line of thought. My main comment is that the Solomonoff perspective feels somewhat opposite to the usual NFL/“inductive bias” story (where the claim is that a good image model needs a good prior over image statistics, etc.). Yes, a simplicity bias is a kind of inductive bias, but it’s supposed to be universal (domain independent). And if generic architectures really give us this prior “for free”, as suggested by results like Dingle et al., then it seems the hard part isn’t the prior, but rather being able to sample from it conditioned on low training error (i.e., the training process).
That said, this line of reasoning—if taken literally—seems difficult to reconcile with some observed facts, e..g, things like architecture choice and data augmentation do seem to matter for generalization. To me, these suggest that you need some inductive bias beyond algorithmic simplicity alone. (Possibly another way to think about it: the smaller the dataset, the more the additive constant in Kolmogorov complexity starts to matter.)
Yeah, I’ve seen those! They do express similar ideas, and I think Chris does serious work. I completely agree with the claim “the parameter-function map is strongly biased towards simple functions,” basically for SLT reasons (though one has to be careful here to avoid saying something tautological, it’s not as simple as “SLT says learning is biased towards lower LLC solutions, therefore it has a simplicity bias”).
There’s a good blog post discussing this work that I read a few years ago. Keeping in mind I read this three years ago and so this might be unfair, I remember my opinion on that specific paper being something along the lines of “the ideas are great, the empirical evidence seems okay, and I don’t feel great about the (interpretation of the) theory.” In particular I remember reading this comment thread and agreeing with the perspective of interstice, for whatever that’s worth. But I could change my mind.
Thanks for this note, I really enjoyed reading it.
One terminological point: I’m a bit confused by your use of “program” / “effective program”, especially where you contrast “programs” with “neural networks”. In algorithmic information theory, a program is any string that computes a function on a UTM. Under that definition, a particular neural network (topology + weights, plus maybe a short ‘interpreter’) already is a program. The fact that weights are often modeled as continuous, while UTM inputs are discrete, doesn’t seem like a key difference, since weights are always discretized in practice.
So, I suspect your discussion of “finding short programs” is more about “finding simple functions”, i.e., low Kolmogorov-complexity functions with short compressed descriptions. That reformulation also makes more sense in light of generalization, since generalization is a property of the computed function, not the specific implementation (a lookup-table and a Fourier-style implementation of modular arithmetic generalize the same way, if they compute the same function).
In Solomonoff induction, the distinction between “finding short programs” and “finding simple functions” doesn’t really matter, since they end up being essentially the same. However, it seems this distinction is important in the setting of machine learning, where we are literally searching over a space of programs (parameterized neural networks) that all have the same description length (number of parameters * precision).
Hey Artemy, glad you enjoyed the post!
Thanks for this comment, I think this is a pretty crucial but subtle clarification that I hard time conveying.
You’re right that there’s some tension here. A neural network already trivially is a program. And you’re right that in Solomonoff induction, “short program” and “simple function” collapse together by definition.
But I think the reformulation to “finding simple functions” loses something crucial. The key is that there’s a three-layer picture, not two:
Parameters → Effective Program → Function
I tried to gesture at this in the post:
As in, an FPGA configuration doesn’t directly specify a function—it specifies a circuit, which then computes a function. All configurations have the same bit-length, but the circuits they implement vary enormously in complexity. You make the effective circuit simpler than the full substrate with dead logic blocks, redundant paths, etc.
Now, when it comes to neural networks, all networks in an architecture class have the same parameter count, as you note. But through degeneracies (dead neurons, low-rank weights, redundant structure, and a whole lot more), the “effective program” can be far simpler than the parameter count suggests.
Now, why does this middle layer matter? Why not just talk about functions directly?
Because there’s no notion of function simplicity that doesn’t route through implementations. To say a function is “simple” (low Kolmogorov complexity) is to say there exists a short program computing it. But “exists” quantifies over all possible programs. Gradient descent doesn’t get access to that—it only sees the implementation it actually has, and the immediate neighborhood in parameter space. Two programs computing the same function aren’t interchangeable from the learner’s perspective; only one of them is actually in front of you, and that’s what determines where you can go next.
This means two parameter configurations that implement literally the same function everywhere can still behave completely differently under training. They sit at different points in the loss landscape. They have different stability properties. They’re differently accessible from other regions of parameter space. The learning process distinguishes them even though the function doesn’t. (The concrete mechanism is the parameter-function map; the loss is the same if the function is the same, but the gradients are not.)
This is why I actually think it’s important to stick with “finding simple (effective) programs” over “finding simple functions,” as the distinction really does matter.
Now of course, making precise what “effective program” even means is not obvious at all. That’s where I start talking about singular learning theory :)
Thanks, I think I see your point! Still, it seems important to be clear that the core claim—that neural networks are biased toward learning functions with better generalization—is ultimately a claim about the learned function. If I understand correctly, you’re offering one explanation for why this might happen: training induces an implicit search over “effective programs,” and that search has a simplicity bias.
I’m wondering whether your argument can be framed via an analogy to Solomonoff induction. In standard Solomonoff induction, among the functions consistent with the training data, a function’s weight is proportional to the probability that it is computed by a universal Turing machine fed with random bits (for prefix-free machines, this yields a bias toward shorter prefixes). Are you thinking of something analogous for neural networks: among functions consistent with the data, a function’s weight is proportional to the probability that it is computed by a neural-network interpreter when fed with random weights?
If I understand correctly, I completely agree, with one small nit: “generalization” isn’t really a property of a function in isolation—it’s a property of a learning setup. A function just is what it is; it doesn’t generalize or fail to generalize.
But if I understand what you mean correctly, you’re saying something like: “neural networks are biased toward learning functions that perform well on real-world tasks.” I think this is very important point that should be emphasized. This requires two things to work together: (1) the learner has a simplicity bias, and (2) real-world tasks preferentially admit simple solutions. Neither alone is enough.
And I think you’re pointing to the fact that (2) still needs explanation, which is true. It’s ultimately a property of the real-world, not neural networks, so you need some kind of model-independent notion of “simplicity” like Kolmogorov complexity to make this even coherent. I would 100% agree. There’s more I could say here but it rapidly gets speculative and probably a longer discussion.
This is a really sharp question, thank you. I think there’s certainly a Bayesian story of neural networks you could tell here which would work exactly like this (in fact I believe there’s some in-progress SLT work in this direction). I think there’s a good chance this is like 80% of the story.
But there’s some ways that we know SGD must be different. For instance, Bayes should give pseudorandom functions some nontrivial prior weight (they exist in parameter space, they can be hardcoded into the parameters), but SGD will basically never find them (see my discussion on Mixa’s comment). So I think ultimately you need to talk about learning dynamics. Here is also a place where there’s a lot more I could say (I think there’s decent evidence as to what ways SGD differs in the programs it prefers), but it’s a longer discussion.
Interesting! Are you familiar with the paper “Deep neural networks have an inbuilt Occam’s razor”, itself building on the work of “Input–output maps are strongly biased towards simple outputs” by Dingle et al.? I feel like it may be getting at something closely related.
Hi Artemy. Welcome to LessWrong!
Agree completely with what Zach is saying here.
We need two facts
(1) the world has a specific inductive bias
(2) neural networks have the same specific inductive bias
Indeed no free lunch arguments seem to require any good learner to have good inductive bias. In a sense learning is ‘mostly’ about having the right inductive bias.
We call this specific inductive bias a simplicity bias. Informally it agrees with our intuitive notion of low complexity.
Rk. Conceptually it is a little tricky since simplicity is in the eye of the beholder—by changing the background language we can make anything with high algorithmic complexity have low complexity. People have been working on this problem for a while but at the moment it seems radically tricky.
IIRC Aram Ebtekar has a proposed solution that John Wentworth likes; I haven’t understood it myself yet. I think what one wants to say is that the [algorithmic] mutual information between the observer and the observed is low, where the observer implicitly encodes the universal turing machine used. In other words—the world is such that observers within it observe it to have low complexity with regard to their implicit reference machine.
Regardless, the fact that the real world satisfies a simplicity bias is to my mind difficult to explain without anthropics. I am afraid we may end up having to resort to an appeal to some form of UDASSA but others may have other theological commitments.
That’s the bird-eye view of simplicity bias. If you ignore the above issue and accept some sort of formally-tricky-to-define but informally “reasonable” simplicty then the question becomes: why do neural networks have a bias towards simplicity. Well they have a bias towards degeneracy—and simplicity and bias are intimiately connected, see eg:
https://www.lesswrong.com/posts/tDkYdyJSqe3DddtK4/alexander-gietelink-oldenziel-s-shortform?commentId=zH42TS7KDZo9JimTF
“Some form of UDASSA” seems to be right. Why not simply take “difficult to explain otherwise” as evidence (defeasible, of course, like with evidence of physical theories)?
Alex, thanks for the welcome (happy to be here!) and the summary.
I’m generally familiar with this line of thought. My main comment is that the Solomonoff perspective feels somewhat opposite to the usual NFL/“inductive bias” story (where the claim is that a good image model needs a good prior over image statistics, etc.). Yes, a simplicity bias is a kind of inductive bias, but it’s supposed to be universal (domain independent). And if generic architectures really give us this prior “for free”, as suggested by results like Dingle et al., then it seems the hard part isn’t the prior, but rather being able to sample from it conditioned on low training error (i.e., the training process).
That said, this line of reasoning—if taken literally—seems difficult to reconcile with some observed facts, e..g, things like architecture choice and data augmentation do seem to matter for generalization. To me, these suggest that you need some inductive bias beyond algorithmic simplicity alone. (Possibly another way to think about it: the smaller the dataset, the more the additive constant in Kolmogorov complexity starts to matter.)
Yeah, I’ve seen those! They do express similar ideas, and I think Chris does serious work. I completely agree with the claim “the parameter-function map is strongly biased towards simple functions,” basically for SLT reasons (though one has to be careful here to avoid saying something tautological, it’s not as simple as “SLT says learning is biased towards lower LLC solutions, therefore it has a simplicity bias”).
There’s a good blog post discussing this work that I read a few years ago. Keeping in mind I read this three years ago and so this might be unfair, I remember my opinion on that specific paper being something along the lines of “the ideas are great, the empirical evidence seems okay, and I don’t feel great about the (interpretation of the) theory.” In particular I remember reading this comment thread and agreeing with the perspective of interstice, for whatever that’s worth. But I could change my mind.