Zach Furman
The short answer is: no, but hopefully in the future. The long answer is: oh man, that was the post(s) I originally planned to write, this post was originally something of a preface that I thought would take a few hours, and it took >100, so… we’ll see if I can muster the effort. For now, you might get something out of looking into singular learning theory, and possibly the papers I referenced in the post, though I’m very much abruptly throwing you into the deep end here (apologies).
Yeah, I’ve seen those! They do express similar ideas, and I think Chris does serious work. I completely agree with the claim “the parameter-function map is strongly biased towards simple functions,” basically for SLT reasons (though one has to be careful here to avoid saying something tautological, it’s not as simple as “SLT says learning is biased towards lower LLC solutions, therefore it has a simplicity bias”).
There’s a good blog post discussing this work that I read a few years ago. Keeping in mind I read this three years ago and so this might be unfair, I remember my opinion on that specific paper being something along the lines of “the ideas are great, the empirical evidence seems okay, and I don’t feel great about the (interpretation of the) theory.” In particular I remember reading this comment thread and agreeing with the perspective of interstice, for whatever that’s worth. But I could change my mind.
Agreed. I guess the small saving grace (as I’m sure you know) is that the TM under the hood is used to limit the computational power of the family rather than contribute to it. But yeah that part is kind of ugly.
Still, it seems important to be clear that the core claim—that neural networks are biased toward learning functions with better generalization—is ultimately a claim about the learned function.
If I understand correctly, I completely agree, with one small nit: “generalization” isn’t really a property of a function in isolation—it’s a property of a learning setup. A function just is what it is; it doesn’t generalize or fail to generalize.
But if I understand what you mean correctly, you’re saying something like: “neural networks are biased toward learning functions that perform well on real-world tasks.” I think this is very important point that should be emphasized. This requires two things to work together: (1) the learner has a simplicity bias, and (2) real-world tasks preferentially admit simple solutions. Neither alone is enough.
And I think you’re pointing to the fact that (2) still needs explanation, which is true. It’s ultimately a property of the real-world, not neural networks, so you need some kind of model-independent notion of “simplicity” like Kolmogorov complexity to make this even coherent. I would 100% agree. There’s more I could say here but it rapidly gets speculative and probably a longer discussion.
In standard Solomonoff induction, among the functions consistent with the training data, a function’s weight is proportional to the probability that it is computed by a universal Turing machine fed with random bits (for prefix-free machines, this yields a bias toward shorter prefixes). Are you thinking of something analogous for neural networks: among functions consistent with the data, a function’s weight is proportional to the probability that it is computed by a neural-network interpreter when initialized fed with random weights?
This is a really sharp question, thank you. I think there’s certainly a Bayesian story of neural networks you could tell here which would work exactly like this (in fact I believe there’s some in-progress SLT work in this direction). I think there’s a good chance this is like 80% of the story.
But there’s some ways that we know SGD must be different. For instance, Bayes should give pseudorandom functions some nontrivial prior weight (they exist in parameter space, they can be hardcoded into the parameters), but SGD will basically never find them (see my discussion on Mixa’s comment). So I think ultimately you need to talk about learning dynamics. Here is also a place where there’s a lot more I could say (I think there’s decent evidence as to what ways SGD differs in the programs it prefers), but it’s a longer discussion.
Yeah, good catch—the correct notion here is a (uniform) family of Boolean circuits, which are Turing-complete in the sense that every uniform circuit family decides a decidable language, and every decidable language can be decided by a uniform circuit family. (Usefully, they also preserve complexity theory: for example, a language is in P if and only if it can be decided by a P-uniform polynomial-size circuit family.)
Hey Artemy, glad you enjoyed the post!
Thanks for this comment, I think this is a pretty crucial but subtle clarification that I hard time conveying.
You’re right that there’s some tension here. A neural network already trivially is a program. And you’re right that in Solomonoff induction, “short program” and “simple function” collapse together by definition.
But I think the reformulation to “finding simple functions” loses something crucial. The key is that there’s a three-layer picture, not two:
Parameters → Effective Program → Function
I tried to gesture at this in the post:
The network’s architecture trivially has compositional structure—the forward pass is executable on a computer. That’s not the claim. The claim is that training discovers an effective program within this substrate. Think of an FPGA: a generic grid of logic components that a hardware engineer configures into a specific circuit. The architecture is the grid; the learned weights are the configuration.”
As in, an FPGA configuration doesn’t directly specify a function—it specifies a circuit, which then computes a function. All configurations have the same bit-length, but the circuits they implement vary enormously in complexity. You make the effective circuit simpler than the full substrate with dead logic blocks, redundant paths, etc.
Now, when it comes to neural networks, all networks in an architecture class have the same parameter count, as you note. But through degeneracies (dead neurons, low-rank weights, redundant structure, and a whole lot more), the “effective program” can be far simpler than the parameter count suggests.
Now, why does this middle layer matter? Why not just talk about functions directly?
Because there’s no notion of function simplicity that doesn’t route through implementations. To say a function is “simple” (low Kolmogorov complexity) is to say there exists a short program computing it. But “exists” quantifies over all possible programs. Gradient descent doesn’t get access to that—it only sees the implementation it actually has, and the immediate neighborhood in parameter space. Two programs computing the same function aren’t interchangeable from the learner’s perspective; only one of them is actually in front of you, and that’s what determines where you can go next.
This means two parameter configurations that implement literally the same function everywhere can still behave completely differently under training. They sit at different points in the loss landscape. They have different stability properties. They’re differently accessible from other regions of parameter space. The learning process distinguishes them even though the function doesn’t. (The concrete mechanism is the parameter-function map; the loss is the same if the function is the same, but the gradients are not.)
This is why I actually think it’s important to stick with “finding simple (effective) programs” over “finding simple functions,” as the distinction really does matter.
Now of course, making precise what “effective program” even means is not obvious at all. That’s where I start talking about singular learning theory :)
Yes, this is a great paper, and one of the first papers that put me on to the depth separation literature. Can definitely recommend.
Re: “the explanation of how the hierarchical nature of physical processes mean that the subset of functions we care about will tend to have a hierarchical structure and so be well-suited for deep networks to model,” I think this is a fascinating topic, and there’s much to be said here. My personal view here is that this question (but not the answer) is essentially equivalent to the physical Church-Turing thesis: somehow, reality is something that can universally be well-described by compositional procedures (i.e. programs). Searching around for “explanations of the physical Church-Turing thesis” will point you to a wider literature in physics and philosophy on the topic.
“Wait, you mean many people don’t already see things this way?”
I wish this were obvious :). But it’s 2026 and we’re still getting op-eds from professors arguing that LLMs are “stochastic parrots” or “just shallow pattern-matching.” I wish I could tell laypeople “the scientific consensus is that this is wrong,” but I can’t because there isn’t one. The scaling hypothesis was a fringe position five years ago and remains controversial today. “No free lunch” and “deep learning will just overfit” were standard objections until embarrassingly recently, and you’ll still hear them occasionally.
If (a tractable approximation to) the provably optimal learning algorithm isn’t “real learning,” I don’t know what would be. Yet clearly many smart people don’t believe this. And this leads to seriously different expectations about where the field will go: if deep learning is implementing something like Solomonoff induction, the default expectation shifts toward continued scaling—not to say that scaling must work (efficiency limits, data constraints, a thousand practical issues could intervene), but because there’s no in-principle reason to expect a hard ceiling. That’s something people still dispute, loudly.
That being said, as I mention at the beginning, these ideas aren’t new or original to me, and some people do believe versions of these sorts of claims. Let me crudely break it down into a few groups:
Average ML practitioner / ICLR attendee: The working assumption is still closer to “neural networks do black-box function approximation,” maybe with vague gestures toward “feature learning” or something. They may be aware of mechanistic interpretability but haven’t integrated it into their worldview. They’re usually only dimly aware of the theoretical puzzles if at all (why doesn’t the UAT construction work? why does the same network generalize on real data and memorize random labels?).
ML theory community: Well … it depends what subfield they’re in, knowledge is often pretty siloed. Still, bits and pieces are somewhat well-known by different groups, under different names: things like compositionality, modularity, simplicity bias, etc. For instance, Poggio’s group put out a great paper back in 2017 positing that compositionality is the main way that neural networks avoid the curse of dimensionality in approximation theory. Even then, I think these often remain isolated explanations for isolated phenomena, rather than something like a unified thesis. They’re also not necessarily consensus—people still propose alternate explanations to the generalization problem in 2026! The idea of deep learning as a “universal” learning algorithm, while deeply felt by some, is rarely stated explicitly, and I expect would likely receive substantial pushback (“well deep learning performs badly in X scenario where I kneecapped it”). Many know of Solomonoff induction but don’t think about it in the context of deep learning.
Frontier labs / mechanistic interpretability / AIT-adjacent folks: Here, “deep learning is performing something like Solomonoff induction” wouldn’t make anyone blink, even if they might quibble with the details. It’s already baked into this worldview that deep learning is some kind of universal learning algorithm, that it works by finding mechanistic solutions, that there are few in-principle barriers. But even in this group, many aren’t aware of the totality of the evidence—e.g. I talked to an extremely senior mech interp researcher who wasn’t aware of the approximation theory / depth separation evidence. And few will defend the hypothesis in public. Even among those who accept the informal vibes, far fewer actively think the connection could possibly be made formal or true in any precise sense.
So: the post isn’t claiming novelty for the hypothesis (I say this explicitly at the start). It’s trying to put the evidence in one place, say the thing outright in public, and point toward formalization. If you’re already in the third group, much of it will read as elaboration. But that’s not where most people are.
Thank you for bringing this up, the story of learning dynamics is a fascinating one and something I didn’t get the time to treat properly in the post. There really is (at least in theory) quite a substantial gap between Bayesian learning (what Solomonoff uses) and SGD learning.
A favorite example of mine is the task of learning pseudorandom functions: these cannot be learned in polynomial time, else you could distinguish pseudorandom numbers from random numbers and break cryptography. So because SGD training runs in polynomial time, it’s impossible for SGD training to find a pseudorandom function easily. Bayesian learning (which does not run in polynomial time) on the other hand would find the pseudorandom function quite quickly, because you can literally hardcode a psuedorandom generator into the weights of a sufficiently large neural network (Bayes is sort of “omnipotent” over the entire parameter space).
As you mention, learning parities are another great example of a task which is easy for Bayes but hard for SGD (though the reason is different from pseudorandom functions). There is a highly rich literature here, including the paper you linked by Barak which is a favorite of mine. Another personal favorite is the later paper on leap complexity, which shows how the structure of these computational obstacles may give SGD learning a “hysteresis” or “history dependent” effect—learning earlier things shapes what things you learn later, something which doesn’t happen at all for Bayes.
I think we’re talking past each other and getting stuck on terminology. Let me be more explicit about what the thesis is contrasting with, without the contested terms.
The default view of neural networks—the one implicit in most ML theory and practice—is that they’re black-box function approximators. The Universal Approximation Theorem says they can represent any continuous function given enough parameters. We train them end-to-end, we don’t know what the weights mean, and we don’t expect them to mean anything in particular. They’re solutions to an optimization problem, not representations of anything.
The thesis is that this view is wrong, or at least deeply incomplete. When we look inside trained networks (mechanistic interpretability), we find compositional, algorithm-like structures: edge detectors composing into shape detectors composing into object detectors; a transformer learning a Fourier-based algorithm for modular addition. These aren’t arbitrary learned functions—they look like programs built from reusable parts.
The claim is that this is not a coincidence. Deep learning succeeds because it’s finding such structures, because it’s performing something like a simplicity-biased search over a universal hypothesis class of compositional solutions (read: space of programs). This sounds a lot like Solomonoff induction, which we already know is the theoretical ideal of supervised learning. That is: what if secretly deep learning is performing a tractable approximation to the learning algorithm we already know is optimal?
If you already take it as obvious that “learning is finding programs,” even when it comes to deep neural networks, then yes, much of the post will seem like elaboration rather than argument. But that’s not the mainstream view in ML, and the evidence that networks actually learn interpretable compositional algorithms is relatively recent and not widely appreciated.
Does that clarify where the thesis is coming from?
As far as I can tell, that’s the entire meat of this post. Programs can be scene as fancy curves. Okay, I see a few paragraphs about that, plus pages and pages of math tangents. What else am I supposed to get from reading?
I think you’ve pattern-matched this to a claim I’m not making, and in fact the claim runs in the opposite direction from what you describe.
You write: “Programs can be seen as fancy curves.” Call this Claim A: program synthesis is a special case of function approximation. Programs are functions, so searching for programs is searching for functions. That’s almost trivially true.
But the post is arguing Claim B: deep learning—which looks like generic function approximation—is actually implementing something like Solomonoff induction. These aren’t the same claim. Claim A is obvious; Claim B is surprising and non-obvious.
Not all function approximation is program synthesis. Polynomial regression isn’t “secretly doing program synthesis” in any non-trivial sense. Neither are kernel methods, decision trees, or genetic programming on fixed AST grammars. These are narrow hypothesis classes where you, the practitioner, chose a representation embedding your assumptions about problem structure. I explicitly address this in the post:
Two things make this different from ordinary function fitting.
First, the search is general-purpose. Linear regression searches over linear functions. Decision trees search over axis-aligned partitions. These are narrow hypothesis classes, chosen by the practitioner to match the problem. The claim here is different: deep learning searches over a space that can express essentially any efficient computable function. It’s not that networks are good at learning one particular kind of structure—it’s that they can learn whatever structure is there.
Second, the search is guided by strong inductive biases. Searching over all programs is intractable without some preference for certain programs over others. The natural candidate is simplicity: favor shorter or less complex programs over longer or more complex ones. This is what Solomonoff induction does—it assigns prior probability to programs based on their length, then updates on data.
The puzzle isn’t “how does function approximation relate to programs?” The puzzle is: how could gradient descent on a deep neural network possibly behave like a simplicity-biased search over programs, when we never specified that objective, and when we don’t know how to do that efficiently (poly-time) ourselves? The remarkable thing about deep learning is that it searches a universal hypothesis class and somehow doesn’t drown in it.
I did not get what the point of this post is or the intended takeaways.
I write at the beginning that I see this as “a snapshot of a hypothesis that seems increasingly hard to avoid, and a case for why formalization is worth pursuing. I discuss the key barriers and how tools like singular learning theory might address them towards the end of the post.” That is, I am 1. making a hypothesis about how deep learning works, and 2. attempting to sketch research directions that aim to formally nail down the hypothesis.
BTW, you’ll probably find the keyword “program induction” more fruitful than “program synthesis.”
This is a good point on terminology—I did go back and forth. “Program induction” is closer to the most relevant existing literature, but I chose “synthesis” deliberately. “Induction” implies sequence prediction and doesn’t extend naturally to settings like RL. But the bigger reason I chose “synthesis”: I expect deep learning is doing something closer to constructing programs than enumerating over them. Gradient descent seems not to be (and couldn’t possibly be) doing brute-force search through program space, but rather building something incrementally. That’s the sharpest break I expect from Solomonoff, and “synthesis” seems to gesture at it better than “induction” does. But I could be convinced to change my mind.
Deep learning as program synthesis
Really love the post overall, thank you for putting this together! There’s a whole lot I could say here, but for now I’ll just be annoying and go off on a tangent which I wrote up before realizing just how tangential it is. I think it still might be a useful clarification for some people, so I’ll post it anyway though.
In some sense the main geometric object of SLT is the Fisher information matrix , which is the negative expected Hessian of the log-likelihood (that is, the Hessian of the population loss).
I’m sure you’re aware of this and were just presenting a simplified explanation, but for anyone who might take this too much at face value, the Fisher information matrix (FIM) is not the negative expected Hessian of the log-likelihood in general. The FIM is defined as:whereas the Hessian is defined as
The difference is whether the expectation is taken over the true distribution or the current model distribution . They coincide only when 1. the data generating distribution is realizable by a true parameter such that and 2. we’re evaluating the Hessian at .
I mention this only because the FIM is often misunderstood. See for instance this paper detailing various common conflations people make here.
How do I think of the FIM, then? A more geometric way to understand the FIM is as follows. Consider the Hellinger embedding, which maps parameters to square-root probability densities:The target space is (a subset of) of , which has a flat Euclidean inner product . Let be the Jacobian of this map. Then the FIM is simply:
This is clarifying because it makes precise exactly the story the post is making, about some parameter directions being “sloppier” vs “more sensitive” at impacting the model. The key fact is that eigenvalues of the FIM are (four times) the squared singular values of the Jacobian .
Large eigenvalues of the FIM thus correspond to parameter directions that move the distribution significantly in ; small eigenvalues correspond to directions along which the distribution barely changes. The kernel of consists of directions that are completely invisible to the model—distinct parameters, identical distributions.
I’ve been trying to understand modules for a long time. They’re a particular algebraic structure in commutative algebra which seems to show up everywhere any time you get anywhere close to talking about rings—and I could never figure out why. Any time I have some simple question about algebraic geometry, for instance, it almost invariably terminates in some completely obtuse property of some module. This confused me. It was never particularly clear to me from their definition why modules should be so central, or so “deep.”
I’m going to try to explain the intuition I have now, mostly to clarify this for myself, but also incidentally in the hope of clarifying this for other people. I’m just a student when it comes to commutative algebra, so inevitably this is going to be rather amateur-ish and belabor obvious points, but hopefully that leads to something more understandable to beginners. This will assume familiarity with basic abstract algebra. Unless stated otherwise, I’ll restrict to commutative rings because I don’t understand much about non-commutative ones.
The typical motivation for modules: “vector spaces but with rings”
The typical way modules are motivated is simple: they’re just vector spaces, but you relax the definition so that you can replace the underlying field with a ring. That is, an -module is a ring and an abelian group , and an operation that respects the ring structure of , i.e. for all and :
This is literally the definition of a vector space, except we haven’t required our scalars to be a field, only a ring (i.e. multiplicative inverses don’t have to always exist). So, like, instead of multiplying vectors by real numbers or something, you multiply vectors by integers, or a polynomial—those are your scalars now. Sounds simple, right? Vector spaces are pretty easy to understand, and nobody really thinks about the underlying field of a vector space anyway, so swapping the field out shouldn’t be much of a big deal, right?
As you might guess from my leading questions, this intuition is profoundly wrong and misleading. Modules are far more rich and interesting than vector spaces. Vector spaces are quite boring; two vector spaces over the same field are isomorphic if and only if they have the same dimension. This means vector spaces are typically just thought of as background objects, a setting which you quickly establish before building the interesting things on top of. This is not the role modules play in commutative algebra—they carry far more structure intrinsically.
A better motivation for modules: “ring actions”
Instead, the way you should think about a module is as a representation of a ring or a ring action, directly analogous to group actions or group representations.[1] Modules are a manifestation of how the ring acts on objects in the “real world.”[2] Think about how group actions tell you how groups manifest in the real world: the dihedral group is a totally abstract object, but it manifests concretely in things like “mirroring/rotating my photo in Photoshop,” where implicitly there is now an action of on the pixels of my photo.
In fact, we can define modules just by generalizing the idea of group actions to ring actions. How might we do this? Well, remember that a group action of a group on a set is just a group homomorphism , where is the set of all endomorphisms on (maps from to itself). In other words, for every element of , we just have to specify how it transforms elements of . Can we get a “ring action” of a ring on a set just by looking at a ring homomorphism ?
That doesn’t quite work, because is in general a group, not a ring—what would the “addition” operation be? But if we insist that is a group, giving it an addition operation, that turns into a ring, with the operation . (This addition rule only works to produce another endomorphism if is abelian, which is a neat little proof in itself!). Now our original idea works: we can define a “ring action” of a ring on an abelian group as a ring homomorphism .
It turns out that this is equivalent to our earlier definition of a module: our “ring action” of on is literally just the definition of an -module . We got to the definition of a module just by taking the definition of a group action and generalizing to rings in the most straightforward possible way.
To make this more concrete, let’s look at a simple illustrative example, which connects modules directly to linear algebra.
Let our ring be , the ring of polynomials with complex coefficients.
Let our abelian group be , the standard 2D plane.
A module structure is a ring homomorphism . The ring is just the ring of 2x2 complex matrices, .
Now, a homomorphism out of a polynomial ring like is completely determined by one thing: where you send the variable . So, to define a whole module structure, all we have to do is choose a single 2x2 matrix to be the image of .
The action of any other polynomial, say , on a vector is then automatically defined: . In short, the study of -modules on is literally the same thing as the study of 2x2 matrices.
Thinking this way immediately clears up some confusing “module” properties. For instance, in a vector space, if a scalar times a vector is zero (), then either or . This fails for modules. In our -module defined by the matrix , the polynomial is not zero, but for any vector , . This phenomenon, called torsion, feels strange from a vector space view, but from the “ring action” view, it’s natural: some of our operators (like ) can just happen to be the zero transformation.
Why are modules so deep?
Seeing modules as “ring actions” starts to explain why you should care about them, but it still doesn’t explain why the theory of modules is so deep. After all, if we compare to group actions, the theory of group actions is in some sense just the theory of groups—there’s really not much to say about a faithful group action beyond the underlying group itself. Group theorists don’t spend much time studying group actions in their own right—why do commutative algebraists seem so completely obsessed with modules? Why are modules more interesting?
The short answer is that the theory of modules is far richer: even when it comes to “faithful” modules, there may be multiple, non-isomorphic modules for the same ring and abelian group . In other words, unlike group actions, the ways in which a given ring can manifest concretely are not unique, even when acting faithfully on the same object.
To show this concretely, let’s return to our module example. Remember that a -module with structure is determined just by , where we choose to send the element . We can define two different modules here which are not isomorphic:
First, suppose . In this module, the action of the polynomial on a vector is: . The action of is to scale the first coordinate by 2 and the second by 3.
Alternatively, suppose . In this module, the action of on a vector is: . The action of sends the vector to . The action of sends everything to the zero vector because is the zero matrix.
These are radically different modules.
In module (defined by ), the action of is diagonalizable. The module can be broken down into two simple submodules (the x-axis and the y-axis). This is a semisimple module.
In module (defined by ), the action of is not diagonalizable. There is a submodule (the x-axis, spanned by (1,0)) but it has no complementary submodule. is indecomposable but not simple.
The underlying abelian group is the same in both cases. The ring is the same. But the “action”—which is determined by our choice of the matrix - creates completely different algebraic objects.
The analogy: a group is a person, a ring is an actor
This was quite a perspective shift for me, and something I had trouble thinking about intuitively. A metaphor I found helpful is to think of it like this: a group is like a person, while a ring is like an actor.
A group is a person: A group has a fixed identity. Its essence is its internal structure of symmetries. When a group acts on a set , it’s like a person interacting with their environment. The person doesn’t change; the action simply reveals their inherent nature. If the action is trivial, it means that person has no effect on that particular environment. If it’s faithful, the environment perfectly reflects the person’s structure. But the person remains the same entity. The question is: “What does this person do to this set?”
A ring is an actor: A ring is more like a professional actor. An actor has an identity, skills, and a range (the ring’s internal structure). But their entire purpose is to manifest as different characters in different plays.
The module is the performance.
The underlying abelian group of is the stage.
The specific module structure (the homomorphism ) is the role or the script the actor is performing.
Using our example: The actor is the polynomial ring, . This actor has immense range.
Performance 1: The actor is cast in the role . The performance is straightforward, stable, and easily understood (semisimple).
Performance 2: The same actor is cast in the role . The performance is a complex tragedy. The character’s actions are subtle, entangled, and ultimately lead to nothingness (nilpotent).
The “actor” is the same. The “stage” is the same. But the “performances” are fundamentally different. To understand the actor , you cannot just watch one performance. You gain a more complete appreciation for their range, subtlety, and power by seeing all the different roles they can inhabit.
Why can rings manifest in multiple ways?
It’s worth pausing for a moment to notice the suprising fact that modules can have this sort of intricate structure, when group actions don’t. What is it about rings that lets them act in so many different “roles”, in a way that groups can’t?
In short, the ring axioms force the action of a ring’s elements to behave like linear transformations rather than simple permutations.
A group only has one operation. Its action only needs to preserve that one operation: . It’s a rigid structural mapping.
A ring has two operations ( and ) that are linked by distributivity. A ring action (module) must preserve both. It must be a homomorphism of rings. This extra constraint (linearity with respect to addition) is what paradoxically opens the door to variety. It demands that the ring’s elements behave not just like permutations, but like linear transformations. And there are many, many ways to represent a system of abstract operators as concrete linear transformations on a space. The vast gap between group actions and group representations can be seen as a microcosm of this idea.
The object is its category of actions
Part of the reason why I like this metaphor is because it points towards a deeper understanding of how we should think about rings in the first place. This leads to, from my position, one of the most fascinating ideas in modern algebra:
You do not truly understand a ring by looking at itself. You understand by studying the category of all possible -modules.
The “identity” of the ring is not its list of elements, but the totality of all the ways it can manifest—all the performances it can give.
A simple ring like a field is a “typecast actor.” Every performance (every vector space) looks basically the same (it’s determined by one number: the dimension).
A complex ring like or the ring of differential operators is a “virtuoso character actor.” It has an incredibly rich and varied body of work, and its true nature is only revealed by studying the patterns and structures across all of its possible roles (the classification of its modules).
So, why care?
Alright, so hopefully some of that makes sense. We started with the (in my opinion) unhelpful idea of a module as a “vector space over a ring” and saw that it’s much more illuminating to see it as a “ring action”—the way a ring shows up in the world. The story got deeper when we realized that, unlike groups, rings can show up in many fundamentally different ways, analogous to an actor playing a vast range of roles.
So where does this leave us? For me, at least, modules don’t feel as alien anymore. At the risk of being overly glib, the reason modules seem to show up everywhere is because they are everywhere. They are the language we use to talk about the manifestations of rings. When a property of a geometric space comes down to a property of a module, it’s because the ring of functions on that space is performing one of its many roles, and the module is that performance.
There’s further you could take this (structure of modules over PIDs, homological algebra, etc), but this seems like a good place to stop for now. Hopefully this was helpful!
- ^
In fact, modules literally generalize group representation theory!
- ^
Sometimes in a quite literal way. In quantum mechanics, the ring is an algebra of observables, and a specific physical system is defined by choosing a specific operator for the Hamiltonian, . This is directly analogous to how we chose a specific matrix for to define our module, as in the example of -modules we study later.
Yeah, I completely agree this is a good research direction! My only caveat is I don’t think this is a silver bullet in the same way capabilities benchmarks are (not sure if you’re arguing this, just explaining my position here). The inevitable problem with interpretability benchmarks (which to be clear, your paper appears to make a serious effort to address) is that you either:
Train the model in a realistic way—but then you don’t know if the model really learned the algorithm you expected it to
Train the model to force it to learn a particular algorithm—but them you have to worry about how realistic that training method was or the algorithm you forced it to learn
This doesn’t seem like an unsolvable problem to me, but it does mean that (unlike capabilities benchmarks) you can’t have the level of trust in your benchmarks that allow you to bypass traditional scientific hygiene. In other words, there are enough subtle strings attached here that you can’t just naively try to “make number go up” on InterpBench in the same way you can with MMLU.
I probably should emphasize I think trying to bring interpretability into contact with various “ground truth” settings seems like a really high value research direction, whether that be via modified training methods, toy models where possible algorithms are fairly simple (e.g. modular addition), etc. I just don’t think it changes my point about methodological standards.
Zach Furman’s Shortform
Understanding deep learning isn’t a leaderboard sport—handle with care.
Saliency maps, neuron dissection, sparse autoencoders—each surged on hype, then stalled[1] when follow‑up work showed the insight was mostly noise, easily spoofed, or valid only in cherry‑picked settings. That risks being negative progress: we spend cycles debunking ghosts instead of building cumulative understanding.
The root mismatch is methodological. Mainstream ML capabilities research enjoys a scientific luxury almost no other field gets: public, quantitative benchmarks that tie effort to ground truth. ImageNet accuracy, MMLU, SWE‑bench—one number silently kills bad ideas. With that safety net, you can iterate fast on weak statistics and still converge on something useful. Mechanistic interpretability has no scoreboard for “the network’s internals now make sense.” Implicitly inheriting benchmark‑reliant habits from mainstream ML therefore swaps a ruthless filter for a fog of self‑deception.
How easy is it to fool ourselves? Recall the “Could a neuroscientist understand a microprocessor?” study: standard neuroscience toolkits—ablation tests, tuning curves, dimensionality reduction—were applied to a 6502 chip whose ground truth is fully known. The analyses produced plausible‑looking stories that entirely missed how the processor works. Interpretability faces the same trap: shapely clusters or sharp heat‑maps can look profound until a stronger test dissolves them.
What methodological standard should replace the leaderboard? Reasonable researchers will disagree[2]. Borrowing from mature natural sciences like physics or neuroscience seems like a sensible default, but a proper discussion is beyond this note. The narrow claim is simpler:
Because no external benchmark will catch your mistakes, you must design your own guardrails. Methodology for understanding deep learning is an open problem, not a hand‑me‑down from capabilities work.
So, before shipping the next clever probe, pause and ask: Where could I be fooling myself, and what concrete test would reveal it? If you don’t have a clear answer, you may be sprinting without the safety net this methodology assumes—and that’s precisely when caution matters most.
- ^
This is perhaps a bit harsh—I think SAEs for instance still might hold some promise, and neuron-based analysis still has its place, but I think it’s fair to say the hype got quite ahead of itself.
- ^
Exactly how much methodological caution is warranted here will obviously be a point of contention. Everyone thinks the people going faster than them are reckless and the people going slower are needlessly worried. My point here is just to think actively about the question—don’t just blindly inherit standards from ML.
- ^
It’s not relevant to predictions about how well the learning algorithm generalises. And that’s the vastly more important factor for general capabilities.
Quite tangential to your point, but the problem with the universal approximation theorem is not just “it doesn’t address generalization” but that it doesn’t even fulfill its stated purpose: it doesn’t answer the question of why neural networks can space-efficiently approximate real-world functions, even with arbitrarily many training samples. The statement “given arbitrary resources, a neural network can approximate any function” is actually kind of trivial—it’s true not only of polynomials, sinusoids, etc, but even just a literal interpolated lookup table (if you have an astronomical space budget). It turns out the universal approximation theorem requires exponentially many neurons (in the size of the input dimension) to work, far too much to be practical—in fact this is actually the same amount of resources a lookup table would cost. This is fine if you want to approximate a 2D function or something, but this goes nowhere to explaining why, like, even a space-efficient MNIST classifier is possible. The interesting question is, why can neural networks efficiently approximate the functions we see in practice?
(It’s a bit out of scope to fully dig into this, but I think a more sensible answer is something in the direction of “well, anything you can do efficiently on a computer, you can do efficiently in a neural network”—i.e. you can always encode polynomial-size Boolean circuits into a polynomial-size neural network. Though there are some subtleties here that make this a little more complicated than that.)
After convergence, the samples should be viewed as drawn from the stationary distribution, and ideally they have low autocorrelation, so it doesn’t seem to make sense to treat them as a vector, since there should be many equivalent traces.
This is a very subtle point theoretically, so I’m glad you highlighted this. Max may be able to give you a better answer here, but I’ll try my best to attempt one myself.
I think you may be (understandably) confused about a key aspect of the approach. The analysis isn’t focused on autocorrelation within individual traces, but rather correlations between different input traces evaluated on the same parameter samples from SGLD.What this approach is actually doing, from a theoretical perspective, is estimating the expected correlation of per-datapoint losses over the posterior distribution of parameters. SGLD serves purely as a mechanism for sampling from this posterior (as it ordinarily does).
When examining correlations between and across different inputs but identical parameter samples , the method approximates the posterior expectation
.
Assuming that SGLD is taking unbiased IID samples from the posterior, the dot product of traces is an unbiased estimator of this correlation. The vectorization of traces is therefore an efficient mechanism for computing these expectations in parallel, and representing the correlation structure geometrically.
At the risk of being repetitive, at an intuitive level, this is designed to detect when different inputs respond similarly to the same parameter perturbations. When inputs and share functional circuitry within the model, they’ll likely show correlated loss patterns when that shared circuitry is disrupted at parameter . When two trace vectors are orthogonal, that means the per-sample losses for and are uncorrelated across the posterior. The presence or absence of synchronization reveals functional similarities between inputs that might not be apparent through other means.
What all this means is that we do in fact want to use SGLD for plain old MCMC sampling—we are genuinely attempting to use random samples from the posterior rather than e.g. examining the temporal behavior of SGLD. Ideally, we want all the usual desiderata of MCMC sampling algorithms here, like convergence, unbiasedness, low autocorrelation, etc. You’re completely correct that if SGLD is properly converged and sampling ideally, it has fairly boring autocorrelation within a trace—but this is exactly what we want.
Great questions, thank you!
Yes, this is correct, because (as I explain briefly in the “search problem” section) the loss function factors as a composition of the parameter-function map and the function-loss map, so by chain rule you’ll always get zero gradient in degenerate directions. (And if you’re near degenerate, you’ll get near-zero gradients, proportional to the degree of degeneracy.) So SGD will find it hard to get unstuck from degeneracies.
This is actually how I think degeneracies affect SGD, but it’s worth mentioning that this isn’t the only mechanism that could be possible. For instance, degeneracies also affect Bayesian learning (this is precisely the classical SLT story!), where there are no gradients at all. The reason is that (by the same chain rule logic) the loss function is “flatter” around degeneracies, creating an implicit prior by dedicating more of parameter space to more degenerate solutions. Both mechanisms push in the same direction, creating an inductive bias favoring more degenerate solutions.
This is basically correct, and you can make this precise. The intuition is that robustness to parameter perturbation corresponds to simplicity because many parameter configurations implement the same effective computation—the solution doesn’t depend on fine-tuning.
Singular learning theory measures “how degenerate” your loss function is using a quantity called the local learning coefficient, which you can view as a kind of “robustness to perturbation” measure. Section 3.1 of my paper explains some of the intuition here. In Bayesian learning, this is basically the unambiguously “correct” measure of (model) complexity, because it literally determines the generalization error / free energy to leading order (Main Formula II, Watanabe 2009).