SLT is not about a limit as the number of data points
So there’s two things here: one is the relation between an empirical loss
Most of the nontrivial mathematical content of SLT is exactly about accounting for the difference between
The other is the role of
But it is strange to rule out in principle the application of such techniques to study phenomena at finite
Daniel Murfet
goes to infinity. Or at least, it is about such a limit insofar as talking about the mean of a random variable is about “studying the limit as the number of data points goes to infinity” which is not how one would normally talk about such things. In particular, I think when people use this phrasing they are (either deliberately or not) making a comparison to “infinite width limits” and I do not think this is a correct analogy. and the population loss , which is the expectation over with respect to the dataset. The mean of a random variable (in this case a function) is an idealised quantity never encountered in practice, for sure. However, to describe a theory that is organised around means of random variables as being “about” infinite limits seems at odds with how most people would think about statistics. and the actual you encounter in the real world (and as you say, generalisation has this form: the conceptual content of the theory is a surprising fact, that geometry of the mean object governs the generalisation behaviour at finite , and these are not some exotic effects that are only visible at enormous , as many of the examples in Watanabe’s textbooks will show you). in the asymptotic expansions that characterise some of the central theorems in SLT. Here it is true that one would expect these analyses to be more correct as becomes larger, and at any value of one cannot a priori rule out that “lower order terms” in fact contribute more than higher order terms. But this is not a phenomena or situation unique to SLT, and indeed has the same shape as applications of Laplace approximations everywhere (and is a situation also commonly encountered in mathematical physics). At the end of the day such asymptotic expansions are commonly used across applications of mathematics to real world phenomena, they are highly successful, and theory alone cannot tell you when they are valid: you have to actually do experiments. , as though this was a theory whose domain of applicability is restricted to enormous . There are separate questions one can ask about effective theories etc, and finite phenomena that are not accounted for by the asymptotic expansions, I don’t mean to dismiss any of that as unimportant (and indeed we think about that kind of thing and continue to work on it). However, I want to push back against some oversimplified characterisation of SLT as a “theory about the infinite limit”.
An Alignment Journal: Features and policies
Highly recommended, this was one of the (many) things I learned to think more clearly about from interacting with Cole at the focus period on AI safety we had in Sydney last year. I think this is pretty subtle and interesting and I used to find Sterkenburg’s position convincing.
An Alignment Journal: Coming Soon
u
Certainly simpler to police if you have a clear rule for in vs out
We use a mix of direct instruction, lots of online resources that we manage ourselves, and 1-on-1 tutors via Zoom through the (excellent) startup Modulo. I spent a large amount of time in the first 6 months − 1 year when we started (back during the pandemic) establishing norms and routines around scheduling and patterns that I hoped would lead to him becoming eventually very self-directed. Which did in fact work. That was intensive but now in the steady state the time cost is low.
I’m basically spending no time on preparation per se, but there is a time cost to supervision. We both work full-time and take turns managing him during the day (he’s 9), which means making sure he’s making it to his online classes and paying attention to his schedule, taking him outdoors for visits to museums etc. Most of the time he’s working on projects that he’s passionate about and doesn’t need me except when he gets stuck. He spends a lot of time building levels (for puzzle games or shooters, particularly) and teaching himself tools using YouTube videos and a lot of GPT/Claude.
We know some home-schooling kids with pretty fine-grained schedules, ours is more like a few scheduled things (e.g. online classes) and then big blocks of time where we trust him to do whatever he’s interested in that day.
I do directly instruct him in math and coding.
Right, we homeschool our son because he seems more alive this way
This is the first time I wrote something on LW that I consider to be serious, in that it explored genuinely new ideas in technical depth. I’m pretty happy with how it turned out.
I write a lot of hand-written notes that, years later, become papers. People who are around me know about this habit. This post started as such a hand-written note that I put together in a few hours, and would have likely stayed that way if not for the outlet of LW. The paper this became is “Programs as singularities” (PAS). The treatment there is much better than the (elementary but somewhat gross) calculations here, but it is also 90 pages long and came out more than a year later.I think the idea of structural Bayesianism being hinted at here is correct and important, and is conceptually the foundation for how we think about interpretability at Timaeus. Its role in providing foundations for talking about the structure of agents is just starting to become visible, Dalcy Ku has some nice recent shortform about their work and Timaeus will have work on SLT in the setting of RL coming out soon, as well as more throughout 2026.
Was it worth making this post, vs just waiting to share the ideas in the paper? I’m not sure. Plausibly some people saw it here who wouldn’t otherwise have engaged with the material (PAS is probably a bit intimidating). This post interprets that material more in an alignment setting and draws connections e.g. to RL that we didn’t do in the paper. I think posting works-in-progress like this runs a risk of incentivising flag planting and rewarding people psychologically for half-finished things (which then never get finished, because “that’s done” and nobody has the incentive to do it properly). I thought at the time that this material was “weird” enough that this risk was marginal, as it has turned out to be.
I notice I haven’t done something like this again since March 2024, however.
Love the shoutout to Thom :)
Right on both counts!
There’s a certain point where commutative algebra outgrows arguments that are phrased purely in terms of ideals (e.g. at some point in Matsumura the proofs stop being about ideals and elements and start being about long exact sequences and Ext, Tor). Once you get to that point, and even further to modern commutative algebra which is often about derived categories (I spent some years embedded in this community), I find that I’m essentially using a transplanted intuition from that “old world” but now phrased in terms of diagrams in derived categories.
E.g. a lot of Atiyah and Macdonald style arguments just reappear as e..g arguments about how to use the residue field to construct bounded complexes of finitely generated modules in the derived category of a local ring. Reconstructing that intuition in the derived category is part of making sense of the otherwise gun-metal machinery of homological algebra.Ultimately I don’t see it as different, but the “externalised” view is the one that plugs into homological algebra and therefore, ultimately, wins.
(Edit: saw Simon’s reply after writing this, yeah agree!)
Yeah it’s a nice metaphor. And just as the most important thing in a play is who dies and how, so too we can consider any element as a module homomorphism and consider the kernel which is called the annihilator (great name). Then factors as where the second map is injective, and so in some sense is “made up” of all sorts of quotients where varies over annihilators of elements.
There was a period where the structure of rings was studied more through the theory of ideals (historically this as in turn motivated by the idea of an “ideal” number) but through ideas like the above you can see the theory of modules as a kind of “externalisation” of this structure which in various ways makes it easier to think about. One manifestation of this I fell in love with (actually this was my entrypoint into all this since my honours supervisor was an old-school ring theorist and gave me Stenstrom to read) is in torsion theory.
One of my son’s most vivid memories of the last few years (and which he talks about pretty often) is playing laser tag at Wytham Abbey, a cultural practice I believe instituted by John and which was awesome, so there is a literal five-year-old (well seven-year-old at the time) who endorses this message!
Makes sense to me, thanks for the clarifications.
I found working through the details of this very informative. For what it’s worth, I’ll share here a comment I made internally at Timaeus about it, which is that in some ways this factorisation into and reminds me of the factorisation into the map from a model to its capability vector (this being the analogue of ) and the map from capability vectors to downstream metrics (this being the analogue of ) in Ruan et al’s observational scaling laws paper.
In your case the output metrics have an interesting twist, in that you don’t want to just predict performance but also in some sense variations of performance within a certain class (by e.g. varying the prompt), so it’s some kind of “stable” latent space of capabilities that you’re constructing.
Anyway, factoring the prediction of downstream performance/capabilities through some kind of latent space object in your case, or latent spaces of capabilities in Ruan et al’s case, seems like a principled way of thinking about the kind of object we want to put at the center of interpretability.
As an entertaining aside: as an algebraic geometer the proliferation of ’s i.e. “interpretability objects” between models and downstream performance metrics reminds me of the proliferation of cohomology theories and the search for “motives” to unify them. That is basically interpretability for schemes!
I is evaluated on utility for improving time-efficiency and accuracy in solving downstream tasks
There seems to be a gap between this informal description and your pseudo-code, since in the pseudo-code the parameters only parametrise the R&D agent . On the other hand is distinct and presumed to be not changing. At first reasoning from the pseudo-code I had the objection that the execution agent can’t be completely static, because it somehow has to make use of whatever clever interpretability outputs the R&D agent comes up with (e.g. SAEs don’t use themselves to solve OOD detection or whatever). Then I wondered if you wanted to bound the complexity of somewhere. Then I looked back and saw the formula which seems to cleverly bypass this by having the R&D agent have to do both steps but factoring its representation of .
However this does seem different from the pseudo-code. If this is indeed different, which one do you intend?
Edit: no matter, I should just read more closely clearly takes as input so I think I’m not confused. I’ll leave this comment here as a monument to premature question-asking.
Later edit: ok no I’m still confused. It seems doesn’t get used in your inner loop unless it is in fact (which in the pseudo-code means just a part of what was called in the preceding text). That is, when we update we update for the next round. In which case things fit with your original formula but having essentially factored into two pieces ( on the outside, on the inside) you are only allowing the inside piece to vary over the course of this process. So I think my original question still stands.
So to check the intuition here: we factor the interpretability algorithm into two pieces. The first piece never sees tasks and has to output some representation of the model . The second piece never sees the model and has to, given the representation and some prediction task for the original model perform well across a sufficiently broad range of such tasks. It is penalised for computation time in this second piece. So overall the loss is supposed to motivate
Discovering the capabilities of the model as operationalised by its performance on tasks, and also how that performance is affected by variations of those tasks (e.g. modifying the prompt for your Shapley values example, and for your elicitation example).
Representing those capabilities in a way that amortises the computational cost of mapping a given task onto this space of capabilities in order to make the above predictions (the computation time penalty in the second part).
This is plausible for the same reason that the original model can have good general performance: there are general underlying skills or capabilities that can be assembled to perform well on a wide range of tasks, and if you can discover those capabilities and their structure you should be able to generalise to predict other task performance and how it varies.
Indirectly there is a kind of optimisation pressure on the complexity of just because you’re asking this to be broadly useful (for a computationally penalised ) for prediction on many tasks, so by bounding the generalisation error you’re likely to bound the complexity of that representation.
I’m on board with that, but I think it is possible that some might agree this is a path towards automated research of something but not that the something is interpretability. After all, your need not be interpretable in any straightforward way. So implicitly the space of ’s you are searching over is constrained to something instrinsically reasonably interpretable?
Since later you say “human-led interpretability absorbing the scientific insights offered by I*” I guess not, and your point is that there are many safety-relevant applications of I*(M) even if it is not very human comprehensible.
Wu et al?
There’s plenty, including a line of work by Carina Curto, Katrin Hess and others that is taken seriously by a number of mathematically inclined neuroscience people (Tom Burns if he’s reading can comment further). As far as I know this kind of work is the closest to breaking through into the mainstream. At some level you can think of homology as a natural way of preserving information in noisy systems, for reasons similar to why (co)homology of tori was a useful way for Kitaev to formulate his surface code. Whether or not real brains/NNs have some emergent computation that makes use of this is a separate question, I’m not aware of really compelling evidence.
There is more speculative but definitely interesting work by Matilde Marcolli. I believe Manin has thought about this (because he’s thought about everything) and if you have twenty years to acquire the prerequisites (gamma spaces!) you can gaze into deep pools by reading that too.
I’m ashamed to say I don’t remember. That was the highlight. I think I have some notes on the conversation somewhere and I’ll try to remember to post here if I ever find it.
I can spell out the content of his Koan a little, if it wasn’t clear. It’s probably more like: look for things that are (not there). If you spend enough time in a particular landscape of ideas, you can (if you’re quiet and pay attention and aren’t busy jumping on bandwagons) get an idea of a hole, which you’re able to walk around but can’t directly see. In this way new ideas appear as something like residues from circumnavigating these holes. It’s my understanding that Khovanov homology was discovered like that, and this is not unusual in mathematics.
By the way, that’s partly why I think the prospect of AIs being creative mathematicians in the short term should not be discounted; if you see all the things you see all the holes.
I don’t think SLT explains why SGD on overparametrised nets generalises. I actually think “overparametrised” is a kind of classical term that we shouldn’t be using anymore, but anyway, SLT does provide a mathematical framework in which Bayesian learning with very large models need not generalise poorly, which would have been a very useful prior for generations of theorists thinking about deep learning to have (and if they had, then many years of confusion might have been avoided imo). However, as Lucius pointed out in his comment, Bayesian learning is not SGD and even if that gap is bridged, just because generalisation is possible doesn’t mean you have a sufficient explanation of why it is actually happening.
I wrote about this at some length in this old comment which you might find useful.
Having said that, I think that in time we’ll see the gap between Bayesian learning and SGD is not as profound as it seems right now. While some of the ways they could be directly related are not true, it will turn out to be true I think that comparison of probabilities of regions of parameter space according to the Bayesian posterior do tend to govern the statistics of SGD trajectories to a significant degree. At that point a lot of the qualitative conclusions one might draw from the basic picture of SLT will just be good descriptions of what SGD is up to; but that work remains in the future.