Dmitry Vaintrob comments on Dmitry Vaintrob’s Shortform

Dmitry Vaintrob 7 Sep 2025 22:35 UTC
87 points
8
SLT is a thermodynamic theory of Bayesian learning, but not the thermodynamic theory of Bayesian learning

SLT provides a rigorous mathematical framework for Bayesian learning in a certain regime, but I argue its practical applicability to real neural networks (even in a Bayesian learning/ high-level modeling context) is limited by finite-size effects and high-dimensionality. The valuable empirical work in this space is better understood as ‘thermodynamic interpretations of ML’ rather than validations of SLT proper

I’ve been having lots of conversations with people about SLT. I like SLT as a model for Bayesian learning a lot. At the same time I think that the assumptions of SLT are a model of the reality of learning (including Bayesian learning), in the same way that variants of the harmonic oscillator are a model of a physical system. There are some results that show that every physical system under some assumptions is a harmonic oscillator in a limit, but this limit doesn’t always hold.

I think the place where I am bothered by SLT rhetoric is where interesting experiments get done and interesting thermodynamic parameters get found, but instead of viewing this as results in “thermodynamic interpretation of ML”, there is an incorrect assumption that the observed phenomena are explained (fully or up to a controllable error) by an expansion around a singularity of a singular learning system.

I’m planning on writing more about this, but I’ll try to write out the key arguments to let people look at them and to see if I’m getting something wrong.

Essentially, I think it would be good to coordinate on a language for talking about Bayesian learning that doesn’t overindex on the singular learning limit, and instead uses physics terms (free energy, susceptibility, heat capacity) for the actual observed invariants. I think that in many ways SLT is moving in this direction, and I’m excited about what they have done.

First the things I agree with:
1. We are doing Bayesian learning (or a mild variant of “Boltzmann learning”, which allows rescaling the “size” of Bayesian updates by a scalar factor)
2. If we fix a neural network as a statistical system (its data distribution, architecture, loss) and take the number of data samples, n, to infinity, there is a limit where the singular learning prediction is true.
So there is always a regime where n is sufficiently large that SLT gives an exact answer to Bayesian learning. Why am I objecting to a (mathematically true) fact?

Essentially, the two key issues are that
- Neural networks are high-dimensional systems, and this makes errors in SLT approximation potentially very large (even exponentially large in something like number of parameters) at finite data.
- In order to get the exact Watanabe approximation to hold, we need to assume that for infinite data the singularity is exact. Thus if we have some predicted singular loss $L_{s i n g} (θ)$ depending on a weight choice $θ$ , then we can (at least wrt the guarantees given by SLT theory) use the singularities of $L_{s i n g}$ to get a prediction for the asymptotic only if $L_{s i n g} (θ) = L (θ)$ exactly in the infinite-data limit. If there is any small approximation issue that relates to the architecture / data rather than to the number of samples n (e.g. the difference between the discrete Fourier Transform and the continuous FT on the circle for modular addition, approximations of a polynomial by sigmoids in modular addition and other contexts, even bit complexity issues, etc.), then we can’t rely on the SLT prediction—at least theoretically. Here one might a priori expect that the singular information from an approximate model $L_{s i n g}$ of the loss would be predictive of the singularity of the true loss $L$ , but in fact this is false: the property of being nontrivially singular at all is a measure-zero property of a (loss) function, so if there are any modeling assumptions or possible sources of error or noise, we expect the “true infinite limit” Watanabe prediction to be the same one as for a “nonsingular” loss (i.e., positive-definite Hessian at the limit). Sometimes there are small corrections (often called “gauge”) from the architecture, but these are small compared to the empirical singularity-like effects one observes from thermodynamic measurements.
I want to especially harp on the fact that it takes at worst-case exponential number of samples, and therefore exponential loss precision of the difference $L_{s i n g} - L$ (more precisely, exp of some power of the number of parameters) to mathematically guarantee that the SLT prediction is correct. Thus an argument shaped like “SLT is mathematically correct” is, for realistic models, true but boring.

However my pair of counterarguments isn’t strong by itself. There are lots of cases in the context of mathematical modeling of complex systems where theory says that we might (in worst-case situations) need to wait exponentially long/ get exponential errors/ etc., but in practice finite time suffices. And I think that there are interesting systems where the modeling assumptions above are correct. It’s just that you shouldn’t assume this by default.

Thus it is meaningful to ask the following question:
- in what learning problems, what regimes, and at what scales can Bayesian learning be described by an SLT prediction? In other words, when can we find a function $L_{s i n g} (θ)$ that is a “sufficiently good” approximation of the true infinite-sample loss $L (θ)$ and whose singularities “meaningfully describe the thermodynamic behavior” of Bayesian learning.
I think this is an interesting question. One might object to it by saying that this is hard to measure, since Bayesian learning is extremely hard to study in “realistic” cases. However I’d argue that we have enough examples to start studying this. And again, my sense here is that SLT prediction fail already to first order (i.e., for predicting e.g. the correct asymptotic for $^λ$ up to O(1) rescaling), but they fail in an interesting way.

SLT failures in Bayesian learning.

I’ll write more about this later. But let me give two examples where I think this is the case.

Grokking modular addition (with MSE loss)

In this paper, grokking in (MSE) modular addition is analyzed using a technique called “mean field theory” (this is an extension of more classical work on things like the neural tangent kernel, which drops the highly restrictive assumption that the system is in a “lazy learning” regime: i.e., that the solution is a small perturbation away from a “trivial vacuum”). This is a paper about Bayesian learning and it gives exact predictions that are confirmed by experiment (one can empirically “approximate” Bayesian learning by something called Langeving SGD). The prediction in particular implies an expansion for the SLT term $^λ$ (the “heat capacity”) and for the free energy (roughly, the stat-phys version of what is called “basin volume” in SLT). It turns out that the free energy in the relevant regime is actually explicitly not controlled by any basin around a singularity, but is rather given to first order by a high-dimensionality phenomenon (similar to critical phenomena in thermodynamics). Specifically, the first-order approximation of the free energy is compatible with the system having some fixed number of degrees of freedom per neuron (here: “row of the input weight matrix”). More precisely this paper predicts (and experiment verifies) that to first order, the NN learns to randomly sample each “weight” row from some fixed distribution. (The resulting NN has approximately correct outputs by the central limit theorem; but if one is interested in a more realistic context with a small number of neurons compared to p, higher terms in the mean-field expansion let you predict regimes with higher and higher accuracy; the leading free-energy term will still be the same). This terms is, importantly, not even a rational number (as predicted by SLT), but some numerical integral associated with the neuron distribution (not dominated by any one value or any small set of moments, except in certain limiting regimes). Note that (similar to SLT) the mean-field prediction holds, in a certain idealization, for any “input number” n so long as n is less than some exponential in the number of neurons.

LoRA for empirical “approximately singular” matrices from ML

For our second example, we look at a two-layer “deep(ish) linear network” that tries to model $W_{i n} W_{o u t} = M$ , (here we replace the ReLU with a trivial nonlinearity, and take W_{in}, W_{out} to be the trainable parameters and M to be the fixed “target”), there is an exact formula for the Bayesian learning prediction of this network in terms of the singular values of M (“singular” is a good term here :). The Bayesian learning dynamics are entirely controlled by the “entropy” or “volume” function $V o l (s_{1}, s_{2}, . ., s_{d}),$ which is roughly the volume of the “space of pairs of matrices $W_{i n}$ , $W_{o u t}$ of bounded size whose product is M” (the “bounded size” is a secret parameter here, that is related to the “number of training points” parameter n in SLT).

If the (min of) input/output width is larger than the “hidden layer” width, the function $V o l (s_{1}, . . ., s_{d})$ has singularities and SLT in this case makes a nontrivial prediction: namely, that if some fraction of the singular values $s_{k} = s_{k + 1} = . . . = s_{d} = 0$ is zero (note we can assume WLOG that this is the last $d - k$ values, since singular values are un-ordered), then the function Vol has a singularity. SLT now implies that if the “target” matrix M has lower-than-full rank (i.e. some of the s_i are exactly 0) then in the high-data limit $n \to \infty$ we have an exact asymptotic on the free energy function F(n) (as a function of dataset size).

Here in order for the SLT asymptotic to be true, we want to assume that the first k singular vector of M are exactly zero and the data size parameter n is much larger than the inverse (square) of the smallest nonzero singular values (this is called the “spectral gap”). If we want some kind of exact or asymptotic formula for the free energy that makes valid predictions for n without the singular gap assumptions, or without the assumption that the singular values are exactly 0, we can simply plug the (easily computable) “true” singular values $s_{1}, \dots, s_{d}$ into the (known) exact formula—or a known expansion whose validity (/error bound) we can mathematically verify—and see what we get.

Ultimately this depends on what deep linear models one would want to model “in practice” in the context of ML. This might at first seem like a silly problem: ML is explicity nonlinear and any deep linear model is just a toy.

However, this is not entirely true: in some cases for a realistic neural net, one is interested in doing “LoRA” decomposition. LoRA means different things in different contexts, but one standard context is decomposing a weight matrix $W = W^{ℓ}$ that appears in some layer $ℓ$ of the model into a product of two matrices $W = W_{L} W_{R},$ with $W_{L}$ and $W_{R}$ having lower rank. In general in such a context, one would then train all the weights of the model together (and again, typically LORA is used in a slightly modified context where a low-rank product $W_{L} W_{R}$ somehow supplements rather than replacing an intermediate layer). Nevertheless, we can alway model “some part” of LoRA learning as learning an approximate factorization of a matrix $W$ into a product $W_{L} W_{R}$ (this corresponds to “freezing” all weight layers except those that defined $W$ , and modeling the loss as an approximation loss on $W$ -- sketchy, but I’d guess a reasonable “directional” guess for asymptotics in a suitable regime).

Once we have reduced to this problem, we can now write down the exact Bayesian prediction and the SLT prediction for various values of “dataset size” n. Both predictions depend only on the singular values $s_{1}, . . ., s_{d}$ of the matrix $W$ (this can be seen using symmetry). Now empirically, it turns out that one can approximate the “bulk” of these singular values relatively well up to a constant by a power law $s_{i} = i^{- α}$ for some exponent $α$ on the order of .3-.5 (see here for example). Here the “bulk” captures most of the singular values and it’s in some sense “pretty singular”: for the majority i, the value $i^{- α}$ is pretty small, i.e. “close to a singularity”.

Thus it makes sense to extrapolate this regime and ask whether, for a singular matrix with singularities following a power law $s_{i} \sim i^{- α},$ there is some regime of values of n where an SLT prediction (assuming that all $s_{i}$ below some cutoff are zero and keeping only the singular terms from these) actually describes the true value of the free energy to first order. In other words, we’re replacing the “real” model that approximately solves $W_{L} W_{R} = W$ by an idealized “singular” model that approximately solves $W_{L} W_{R} = W_{s i n g a}$ with the singular of $W_{s i n g}$ given by the formula $s_{i}^{'} = {\begin{matrix} 0, & i > i_{0} s_{i}, & i \leq i_{0} \end{matrix} .$

Unsurprisingly, the answer is a strong “no”: both the prediction that the “approximately singular” $s_{i}$ for $i > i_{0}$ are “essentially zero” and the prediction that the large $s_{i}$ for $i \leq i_{0}$ are large enough to impose a “reasonable gap” completely fail, and even the power law in the asymptotic from the SLT prediction fully fails to capture the power law of the true Bayesian learning prediction in this case. Here again, note that the key issue is the high-dimensionality of the system. (Also note here that I’m blackboxing a lot of math: happy to discuss it more in comments etc., and I’m also planning to write out a more careful version of this later.)

Discussion

One can try to salvage an SLT prediction in the LoRA example (which I think is particularly damning) in a few ways. Maybe:
- The important contribution comes from the “non-power law” part of the tail
- The assumption that LoRA rank reduction is a good model of Bayesian learning more generally is flawed
- etc.
However I think that together with the known results about modular addition, these failure modes show that in a meaningful way, realistic models do not have a good “singular model” with strong predictive power, even if we are only trying to make predictions about Bayesian learning. The regime where the SLT prediction succeeds certainly exists (this is a rigorous mathematical statement after all), but it requires too large a sample number n and imposes too strong of a regularity assumption on the “true” infinite-data loss landscape $L (θ)$ (essentially, an assumption that “small” and “large” phenomena are cleanly separated by a large gap) for it to even approximately give valid predictions in regimes we care about. Again, the “dominant” brunt of my intuition here is that this failure is related to the fact that the predictions of SLT assume that the number of parameters is fixed (i.e. O(1)) as the number of samples goes to infinity—but in reality, the number of parameters is quite high, and the regime where SLT predictions hold exactly is in some sense exponential (or very large) compared to parameter count, and thus never attained (and essentially uninteresting).

Having said this, I want to point out that SLT has a number of really good results that I am deeply excited about. I think they both have good theoretical results for toy models or strongly-asymptotic (but still interesting) regimes where the SLT approximation is exact, which may directionally give good intuition about realistic models (in the same way that the harmonic oscillator gives very deep intuition about quantum and statistical mechanics, despite not all systems being reducible to it—not that in particular, SLT can be understood as a “vastly generalized harmonic oscillator” with the essential property being the existence of a strongly position-localized—though singular—semiclassical approximation; ignore if these words are meaningless to you).

But also people who label their work as “singular learning theory” have really good empirical results where the measurements being made are strictly thermodynamic, and which continue to work in “explicitly non-SLT” contexts such as mean field theory, which capture the high-dimensionality and have totally different asymptotics than what is predicted in the SLT regime. (Note that mean-field is, of course, itself an approximation/ toy model!)

For example: timaeus (the organization that “does SLT”) has an elegant result by Garrett Baker et al. that measures the effects of changes in training distribution on susceptibilities; Nina Panickssery and I have a paper on the “lambda-hat estimator” (heat capacity) tracking the description length of the algorithm learned by modular addition in generalizing vs. memorizing models; there is work on saddles; Timaeus has produced work on the relationship of attention heads and a certain thermodynamic quantity. An upcoming paper by Timaeus that I’m very excited about looks at a susceptibility-like (“conjugate variable”) metric on inputs that generalizes influence functions.

All of these results are (I think) very good results directionally linking statistical mechanics invariants with interesting learning behaviors. Also, none of them are “properly SLT results”: they simply assume standard links between information theory, thermodynamics, and Bayesian learning (some of them developed in the learning context by Sumio Watanabe, who first discovered them in the context of his singular models). None of them assume that one is in a regime where the SLT “approximation from singularities” even approximately holds—and I suspect that if one were to dig even a little in most of these cases (e.g. look at larger-scale dependence on temperature, relationships between single-neuron statistics, etc.), then one would see that we are in a regime where in fact, in a strong sense, any “singular” toy model would fail to predict the behvior of interest in any regime or with any number of terms of the “singular” asymptotic expansion. (Essentially this belief is due to the fact that the lambda-hat estimator does not even approximately asymptote in the way that SLT predicts in the regimes of interest, and because in an intuition I have that I’m hoping to write down later, the SLT approach really fails to track inherently high-dimensional phenomena like grokking.)

I would be happy to be wrong about the failure of the SLT regime, and I retain hope that with sufficient processing/ renormalization, there is some sense in which the thermodynamic behavior of learning is extensively related to singularities in some suitably re-interpreted and lower-dimensional set of “relevant” variables which are not the weights.

Also I am excited about the work that Timaeus is doing and think that the “thermodynamics-aware” approach to learning that they try to follow is significantly underexplored. I just want to point out that there are very deep directions here that decouple from the assumptions of SLT and instead lean into high-dimensionality of the loss landscape, and which produce useful work that can be missed if one’s model of Bayesian learning theory is “everything is singularities”.

Dmitry Vaintrob comments on Dmitry Vaintrob’s Shortform

SLT is a thermodynamic theory of Bayesian learning, but not the thermodynamic theory of Bayesian learning

SLT failures in Bayesian learning.

Grokking modular addition (with MSE loss)

LoRA for empirical “approximately singular” matrices from ML

Discussion