As far as I can tell, that’s the entire meat of this post. Programs can be scene as fancy curves. Okay, I see a few paragraphs about that, plus pages and pages of math tangents. What else am I supposed to get from reading?
I think you’ve pattern-matched this to a claim I’m not making, and in fact the claim runs in the opposite direction from what you describe.
You write: “Programs can be seen as fancy curves.” Call this Claim A: program synthesis is a special case of function approximation. Programs are functions, so searching for programs is searching for functions. That’s almost trivially true.
But the post is arguing Claim B: deep learning—which looks like generic function approximation—is actually implementing something like Solomonoff induction. These aren’t the same claim. Claim A is obvious; Claim B is surprising and non-obvious.
Not all function approximation is program synthesis. Polynomial regression isn’t “secretly doing program synthesis” in any non-trivial sense. Neither are kernel methods, decision trees, or genetic programming on fixed AST grammars. These are narrow hypothesis classes where you, the practitioner, chose a representation embedding your assumptions about problem structure. I explicitly address this in the post:
Two things make this different from ordinary function fitting.
First, the search is general-purpose. Linear regression searches over linear functions. Decision trees search over axis-aligned partitions. These are narrow hypothesis classes, chosen by the practitioner to match the problem. The claim here is different: deep learning searches over a space that can express essentially any efficient computable function. It’s not that networks are good at learning one particular kind of structure—it’s that they can learn whatever structure is there.
Second, the search is guided by strong inductive biases. Searching over all programs is intractable without some preference for certain programs over others. The natural candidate is simplicity: favor shorter or less complex programs over longer or more complex ones. This is what Solomonoff induction does—it assigns prior probability to programs based on their length, then updates on data.
The puzzle isn’t “how does function approximation relate to programs?” The puzzle is: how could gradient descent on a deep neural network possibly behave like a simplicity-biased search over programs, when we never specified that objective, and when we don’t know how to do that efficiently (poly-time) ourselves? The remarkable thing about deep learning is that it searches a universal hypothesis class and somehow doesn’t drown in it.
I did not get what the point of this post is or the intended takeaways.
I write at the beginning that I see this as “a snapshot of a hypothesis that seems increasingly hard to avoid, and a case for why formalization is worth pursuing. I discuss the key barriers and how tools like singular learning theory might address them towards the end of the post.” That is, I am 1. making a hypothesis about how deep learning works, and 2. attempting to sketch research directions that aim to formally nail down the hypothesis.
BTW, you’ll probably find the keyword “program induction” more fruitful than “program synthesis.”
This is a good point on terminology—I did go back and forth. “Program induction” is closer to the most relevant existing literature, but I chose “synthesis” deliberately. “Induction” implies sequence prediction and doesn’t extend naturally to settings like RL. But the bigger reason I chose “synthesis”: I expect deep learning is doing something closer to constructing programs than enumerating over them. Gradient descent seems not to be (and couldn’t possibly be) doing brute-force search through program space, but rather building something incrementally. That’s the sharpest break I expect from Solomonoff, and “synthesis” seems to gesture at it better than “induction” does. But I could be convinced to change my mind.
Hmmm. Okay then, I’d like to understand your point.
But first, can we clear up some terminological confusion?
From this comment, it seems you are using “program synthesis” in ways which are precisely the opposite of its usual meaning. This means I need to do substantial reverse-engineering of what you mean in every line, in addition to trying to understand your points directly.
These are narrow hypothesis classes where you, the practitioner, chose a representation embedding your assumptions about problem structure.
This is a very confusing thing to write. I think you’re saying that various techniques are differentiated from program synthesis because they “choose a representation embedding assumptions about the problem structure.” But can you point to any papers in the program synthesis literature which don’t do this?
But the bigger reason I chose “synthesis”: I expect deep learning is doing something closer to constructing programs than enumerating over them. Gradient descent seems not to be (and couldn’t possibly be) doing brute-force search through program space,
I think you’re saying that you use the term “program synthesis” to mean “things that are not brute-force searches over program space.”
But a brute-force search over program space is a perfectly valid synthesis technique! It’s literally the first technique discussed in my advisor’s Intro to Program Synthesis course. See https://people.csail.mit.edu/asolar/SynthesisCourse/Lecture2.htm (scroll down to “Explicit Enumeration”).
I think we’re talking past each other and getting stuck on terminology. Let me be more explicit about what the thesis is contrasting with, without the contested terms.
The default view of neural networks—the one implicit in most ML theory and practice—is that they’re black-box function approximators. The Universal Approximation Theorem says they can represent any continuous function given enough parameters. We train them end-to-end, we don’t know what the weights mean, and we don’t expect them to mean anything in particular. They’re solutions to an optimization problem, not representations of anything.
The thesis is that this view is wrong, or at least deeply incomplete. When we look inside trained networks (mechanistic interpretability), we find compositional, algorithm-like structures: edge detectors composing into shape detectors composing into object detectors; a transformer learning a Fourier-based algorithm for modular addition. These aren’t arbitrary learned functions—they look like programs built from reusable parts.
The claim is that this is not a coincidence. Deep learning succeeds because it’s finding such structures, because it’s performing something like a simplicity-biased search over a universal hypothesis class of compositional solutions (read: space of programs). This sounds a lot like Solomonoff induction, which we already know is the theoretical ideal of supervised learning. That is: what if secretly deep learning is performing a tractable approximation to the learning algorithm we already know is optimal?
If you already take it as obvious that “learning is finding programs,” even when it comes to deep neural networks, then yes, much of the post will seem like elaboration rather than argument. But that’s not the mainstream view in ML, and the evidence that networks actually learn interpretable compositional algorithms is relatively recent and not widely appreciated.
Does that clarify where the thesis is coming from?
Since I’ve been accused of lazy reading, I want to finish off the terminological discussion and put you in my shoes a bit. I get your motivation for preferring to call it “program synthesis” over “program induction,” but it turns out that’s an established term with about 60 years of history. Basically, to understand how I read it: replace every use of “program synthesis” with “techniques for searching constrained spaces of programs using things like SAT solvers and Monte Carlo.” If you also replace “Daniel Selsam” with “researcher in SAT solvers who started off in a lab that uses Monte Carlo to generate assembly programs,” then I actually think it becomes hard not to read it the way that I did—the way that you said was the opposite of the intended reading. And there aren’t really any clear cues that you are not talking about program synthesis in the existing sense—no clear “I define a new term” paragraph. You might think that the lack of citation to established synthesis researchers would be a tell, but, unfortunately, experience has taught me that’s fairly weak evidence of such.
So I read it again, this time replacing “program synthesis” with “Solomonoff induction.” It does indeed read very differently.
And I read your last comment.
And my main reaction was: “Wait, you mean many people don’t already see things this way?”
I mean, it’s been a full decade since Jacob Andreas defended his thesis on modularity in neural networks. I just checked his Google Scholar. First paper (>1600 citations, published 2016), abstract, first sentence: “Visual question answering is fundamentally compositional in nature.”
If neural nets are not doing a simplicity-biased search over compositional solutions, then what are they doing? Finding the largest program before the smaller ones? Not obtaining intermediate results? Not sharing substructure across tasks while still using a smaller number of weights? Developing novel algorithms that solve the problem in one-shot without any substeps?
I’d naively expect neural nets to, by default, be about as modular as biological systems, or programs evolved with GP. In many ways, very modular, but also with a lot of crazy coupling and fragility. Neural nets gain efficiency over GP by being able to start with a superposition of a vast number of solutions and grow or shrink many of them simultaneously. They also can be given stronger regularization than in biology. I would expect them to find a simple-ish solution but not the simplest. If they only found the most complex ones, that would be way more impressive.
“Wait, you mean many people don’t already see things this way?”
I wish this were obvious :). But it’s 2026 and we’re still getting op-eds from professors arguing that LLMs are “stochastic parrots” or “just shallow pattern-matching.” I wish I could tell laypeople “the scientific consensus is that this is wrong,” but I can’t because there isn’t one. The scaling hypothesis was a fringe position five years ago and remains controversial today. “No free lunch” and “deep learning will just overfit” were standard objections until embarrassingly recently, and you’ll still hear them occasionally.
If (a tractable approximation to) the provably optimal learning algorithm isn’t “real learning,” I don’t know what would be. Yet clearly many smart people don’t believe this. And this leads to seriously different expectations about where the field will go: if deep learning is implementing something like Solomonoff induction, the default expectation shifts toward continued scaling—not to say that scaling must work (efficiency limits, data constraints, a thousand practical issues could intervene), but because there’s no in-principle reason to expect a hard ceiling. That’s something people still dispute, loudly.
That being said, as I mention at the beginning, these ideas aren’t new or original to me, and some people do believe versions of these sorts of claims. Let me crudely break it down into a few groups:
Average ML practitioner / ICLR attendee: The working assumption is still closer to “neural networks do black-box function approximation,” maybe with vague gestures toward “feature learning” or something. They may be aware of mechanistic interpretability but haven’t integrated it into their worldview. They’re usually only dimly aware of the theoretical puzzles if at all (why doesn’t the UAT construction work? why does the same network generalize on real data and memorize random labels?).
ML theory community: Well … it depends what subfield they’re in, knowledge is often pretty siloed. Still, bits and pieces are somewhat well-known by different groups, under different names: things like compositionality, modularity, simplicity bias, etc. For instance, Poggio’s group put out a great paper back in 2017 positing that compositionality is the main way that neural networks avoid the curse of dimensionality in approximation theory. Even then, I think these often remain isolated explanations for isolated phenomena, rather than something like a unified thesis. They’re also not necessarily consensus—people still propose alternate explanations to the generalization problem in 2026! The idea of deep learning as a “universal” learning algorithm, while deeply felt by some, is rarely stated explicitly, and I expect would likely receive substantial pushback (“well deep learning performs badly in X scenario where I kneecapped it”). Many know of Solomonoff induction but don’t think about it in the context of deep learning.
Frontier labs / mechanistic interpretability / AIT-adjacent folks: Here, “deep learning is performing something like Solomonoff induction” wouldn’t make anyone blink, even if they might quibble with the details. It’s already baked into this worldview that deep learning is some kind of universal learning algorithm, that it works by finding mechanistic solutions, that there are few in-principle barriers. But even in this group, many aren’t aware of the totality of the evidence—e.g. I talked to an extremely senior mech interp researcher who wasn’t aware of the approximation theory / depth separation evidence. And few will defend the hypothesis in public. Even among those who accept the informal vibes, far fewer actively think the connection could possibly be made formal or true in any precise sense.
So: the post isn’t claiming novelty for the hypothesis (I say this explicitly at the start). It’s trying to put the evidence in one place, say the thing outright in public, and point toward formalization. If you’re already in the third group, much of it will read as elaboration. But that’s not where most people are.
I think you’ve pattern-matched this to a claim I’m not making, and in fact the claim runs in the opposite direction from what you describe.
You write: “Programs can be seen as fancy curves.” Call this Claim A: program synthesis is a special case of function approximation. Programs are functions, so searching for programs is searching for functions. That’s almost trivially true.
But the post is arguing Claim B: deep learning—which looks like generic function approximation—is actually implementing something like Solomonoff induction. These aren’t the same claim. Claim A is obvious; Claim B is surprising and non-obvious.
Not all function approximation is program synthesis. Polynomial regression isn’t “secretly doing program synthesis” in any non-trivial sense. Neither are kernel methods, decision trees, or genetic programming on fixed AST grammars. These are narrow hypothesis classes where you, the practitioner, chose a representation embedding your assumptions about problem structure. I explicitly address this in the post:
The puzzle isn’t “how does function approximation relate to programs?” The puzzle is: how could gradient descent on a deep neural network possibly behave like a simplicity-biased search over programs, when we never specified that objective, and when we don’t know how to do that efficiently (poly-time) ourselves? The remarkable thing about deep learning is that it searches a universal hypothesis class and somehow doesn’t drown in it.
I write at the beginning that I see this as “a snapshot of a hypothesis that seems increasingly hard to avoid, and a case for why formalization is worth pursuing. I discuss the key barriers and how tools like singular learning theory might address them towards the end of the post.” That is, I am 1. making a hypothesis about how deep learning works, and 2. attempting to sketch research directions that aim to formally nail down the hypothesis.
This is a good point on terminology—I did go back and forth. “Program induction” is closer to the most relevant existing literature, but I chose “synthesis” deliberately. “Induction” implies sequence prediction and doesn’t extend naturally to settings like RL. But the bigger reason I chose “synthesis”: I expect deep learning is doing something closer to constructing programs than enumerating over them. Gradient descent seems not to be (and couldn’t possibly be) doing brute-force search through program space, but rather building something incrementally. That’s the sharpest break I expect from Solomonoff, and “synthesis” seems to gesture at it better than “induction” does. But I could be convinced to change my mind.
Hmmm. Okay then, I’d like to understand your point.
But first, can we clear up some terminological confusion?
From this comment, it seems you are using “program synthesis” in ways which are precisely the opposite of its usual meaning. This means I need to do substantial reverse-engineering of what you mean in every line, in addition to trying to understand your points directly.
This is a very confusing thing to write. I think you’re saying that various techniques are differentiated from program synthesis because they “choose a representation embedding assumptions about the problem structure.” But can you point to any papers in the program synthesis literature which don’t do this?
I think you’re saying that you use the term “program synthesis” to mean “things that are not brute-force searches over program space.”
But a brute-force search over program space is a perfectly valid synthesis technique! It’s literally the first technique discussed in my advisor’s Intro to Program Synthesis course. See https://people.csail.mit.edu/asolar/SynthesisCourse/Lecture2.htm (scroll down to “Explicit Enumeration”).
I think we’re talking past each other and getting stuck on terminology. Let me be more explicit about what the thesis is contrasting with, without the contested terms.
The default view of neural networks—the one implicit in most ML theory and practice—is that they’re black-box function approximators. The Universal Approximation Theorem says they can represent any continuous function given enough parameters. We train them end-to-end, we don’t know what the weights mean, and we don’t expect them to mean anything in particular. They’re solutions to an optimization problem, not representations of anything.
The thesis is that this view is wrong, or at least deeply incomplete. When we look inside trained networks (mechanistic interpretability), we find compositional, algorithm-like structures: edge detectors composing into shape detectors composing into object detectors; a transformer learning a Fourier-based algorithm for modular addition. These aren’t arbitrary learned functions—they look like programs built from reusable parts.
The claim is that this is not a coincidence. Deep learning succeeds because it’s finding such structures, because it’s performing something like a simplicity-biased search over a universal hypothesis class of compositional solutions (read: space of programs). This sounds a lot like Solomonoff induction, which we already know is the theoretical ideal of supervised learning. That is: what if secretly deep learning is performing a tractable approximation to the learning algorithm we already know is optimal?
If you already take it as obvious that “learning is finding programs,” even when it comes to deep neural networks, then yes, much of the post will seem like elaboration rather than argument. But that’s not the mainstream view in ML, and the evidence that networks actually learn interpretable compositional algorithms is relatively recent and not widely appreciated.
Does that clarify where the thesis is coming from?
Thanks for the clarification Zach!
Since I’ve been accused of lazy reading, I want to finish off the terminological discussion and put you in my shoes a bit. I get your motivation for preferring to call it “program synthesis” over “program induction,” but it turns out that’s an established term with about 60 years of history. Basically, to understand how I read it: replace every use of “program synthesis” with “techniques for searching constrained spaces of programs using things like SAT solvers and Monte Carlo.” If you also replace “Daniel Selsam” with “researcher in SAT solvers who started off in a lab that uses Monte Carlo to generate assembly programs,” then I actually think it becomes hard not to read it the way that I did—the way that you said was the opposite of the intended reading. And there aren’t really any clear cues that you are not talking about program synthesis in the existing sense—no clear “I define a new term” paragraph. You might think that the lack of citation to established synthesis researchers would be a tell, but, unfortunately, experience has taught me that’s fairly weak evidence of such.
So I read it again, this time replacing “program synthesis” with “Solomonoff induction.” It does indeed read very differently.
And I read your last comment.
And my main reaction was: “Wait, you mean many people don’t already see things this way?”
I mean, it’s been a full decade since Jacob Andreas defended his thesis on modularity in neural networks. I just checked his Google Scholar. First paper (>1600 citations, published 2016), abstract, first sentence: “Visual question answering is fundamentally compositional in nature.”
If neural nets are not doing a simplicity-biased search over compositional solutions, then what are they doing? Finding the largest program before the smaller ones? Not obtaining intermediate results? Not sharing substructure across tasks while still using a smaller number of weights? Developing novel algorithms that solve the problem in one-shot without any substeps?
I’d naively expect neural nets to, by default, be about as modular as biological systems, or programs evolved with GP. In many ways, very modular, but also with a lot of crazy coupling and fragility. Neural nets gain efficiency over GP by being able to start with a superposition of a vast number of solutions and grow or shrink many of them simultaneously. They also can be given stronger regularization than in biology. I would expect them to find a simple-ish solution but not the simplest. If they only found the most complex ones, that would be way more impressive.
I wish this were obvious :). But it’s 2026 and we’re still getting op-eds from professors arguing that LLMs are “stochastic parrots” or “just shallow pattern-matching.” I wish I could tell laypeople “the scientific consensus is that this is wrong,” but I can’t because there isn’t one. The scaling hypothesis was a fringe position five years ago and remains controversial today. “No free lunch” and “deep learning will just overfit” were standard objections until embarrassingly recently, and you’ll still hear them occasionally.
If (a tractable approximation to) the provably optimal learning algorithm isn’t “real learning,” I don’t know what would be. Yet clearly many smart people don’t believe this. And this leads to seriously different expectations about where the field will go: if deep learning is implementing something like Solomonoff induction, the default expectation shifts toward continued scaling—not to say that scaling must work (efficiency limits, data constraints, a thousand practical issues could intervene), but because there’s no in-principle reason to expect a hard ceiling. That’s something people still dispute, loudly.
That being said, as I mention at the beginning, these ideas aren’t new or original to me, and some people do believe versions of these sorts of claims. Let me crudely break it down into a few groups:
Average ML practitioner / ICLR attendee: The working assumption is still closer to “neural networks do black-box function approximation,” maybe with vague gestures toward “feature learning” or something. They may be aware of mechanistic interpretability but haven’t integrated it into their worldview. They’re usually only dimly aware of the theoretical puzzles if at all (why doesn’t the UAT construction work? why does the same network generalize on real data and memorize random labels?).
ML theory community: Well … it depends what subfield they’re in, knowledge is often pretty siloed. Still, bits and pieces are somewhat well-known by different groups, under different names: things like compositionality, modularity, simplicity bias, etc. For instance, Poggio’s group put out a great paper back in 2017 positing that compositionality is the main way that neural networks avoid the curse of dimensionality in approximation theory. Even then, I think these often remain isolated explanations for isolated phenomena, rather than something like a unified thesis. They’re also not necessarily consensus—people still propose alternate explanations to the generalization problem in 2026! The idea of deep learning as a “universal” learning algorithm, while deeply felt by some, is rarely stated explicitly, and I expect would likely receive substantial pushback (“well deep learning performs badly in X scenario where I kneecapped it”). Many know of Solomonoff induction but don’t think about it in the context of deep learning.
Frontier labs / mechanistic interpretability / AIT-adjacent folks: Here, “deep learning is performing something like Solomonoff induction” wouldn’t make anyone blink, even if they might quibble with the details. It’s already baked into this worldview that deep learning is some kind of universal learning algorithm, that it works by finding mechanistic solutions, that there are few in-principle barriers. But even in this group, many aren’t aware of the totality of the evidence—e.g. I talked to an extremely senior mech interp researcher who wasn’t aware of the approximation theory / depth separation evidence. And few will defend the hypothesis in public. Even among those who accept the informal vibes, far fewer actively think the connection could possibly be made formal or true in any precise sense.
So: the post isn’t claiming novelty for the hypothesis (I say this explicitly at the start). It’s trying to put the evidence in one place, say the thing outright in public, and point toward formalization. If you’re already in the third group, much of it will read as elaboration. But that’s not where most people are.
Thank you! This really made your thesis click for me.