Dmitry Vaintrob

Karma: 2,610

Dmitry Vaintrob 27 Jul 2026 22:32 UTC
9 points
0
on: I’m still mystified by the Born rule
Not a timely comment I know—I was also confused by the power of 2, and I think that simply the correct resolution is that the wave function is a nonlinear simplification of the more fundamental matrix-shaped object, which is the density matrix (explained more here). As to the “what is reality”, I don’t think it’s that much worse than probability theory (you also have to mathematically posit an exponential-dimensional space of states to mathematically formalize the concept of a stochastic process for example, or any BPP algorithm).

I guess we don’t know what’s real but my favorite “sufficient story” for what’s real (and other QM stories are equivalent to it, as I understand) is that the real object is an actual probability distribution on end-of-the universe states, where assuming expansion things can just be modeled as a bunch of elementary particles (probably photons) in e.g. the position basis. The noncommutativity becomes small in the expansion limit, so we get a canonical basis of universe states; this is the ultimate decoherence (where it is rigorous, not an extra assumption), and a (real, not quantum) probability distribution over this basis of “end-of-time states”. This might seem woo-ey, but such a state encodes lots of information; for example, any song on the radio or any light reflected from an object on earth (even very faintly) can be recovered via only small error correction from access to the state of the universe at a later time (just look at frequencies in the shell of photons around earth at a radius corresponding to a particular point in time, adjusting for gravitational lensing and so on).

So a model I like is sort of holographic, where there are two realities: there is the “objective” 3-dimensional reality at time infinity, which is just a probability distribution on states at the end of the universe (nothing quantum, no explicit Born rule) compatible with the big bang. You can imagine some alien race having some supercomputer that models our universe, and it outputs a perfectly reasonable probability distribution on end-of-universe states. But if you sample one of these states, it’s not just a disordered mess—it has things in it like the waveforms of a Miles Davis concert. You can now imagine yourself as that alien trying to interpret it—i.e. trying to explain this particular state/ to find structure in it that you can information-theoretically compress. A natural form of such a structure is to posit an approximate 4-dimensional space-time which can be roughly separated into chaotic microscopic structures (which can be modeled thermodynamically) and irreversible events (like the Miles Davis concert which generates many mutually denoising photons all carrying the same waveform information) which, while not entirely deterministic, are close enough to irreversible to be treated as definite in your compression model. The beings “inside” this universe similarly want to get the best possible compression to understand and interact with their world, so they make a similar set of approximations; we can view “truth” as things where our understanding (insofar as we can write it down by e.g. radioing it out into the universe) agrees with the understanding one would have via access to the end-state.

I don’t think this is likely to be “the answer”—it seems weird to have a theory that requires the heat death of the universe in order to be valid (and I think that other “eventual operator independence” stories can be made). But the piece that’s solid here is that in essentially any model of quantum thermodynamics, entities with different preferred commuting operator bases will tend to have more and more agreement on state as entropy increases, and we can sort of think as consensus reality as the “piece that they will eventually agree on”, perhaps in some not-completely-formal sense.

Note that the eigenvalue story here is incidental: there’s nothing magical here about eigenstates of “measurement operators” (as far as I understand), it is just a nice mathematical model. When an irreversible quantum process occurs (such as a scattered photon causing a phase transition in a magnetic detector system), irreversibility means that we can approximately orthogonally separate end-of-universe states into ones where the detector outputted a zero and ones where it outputted a 1. One nice way to bookkeep this decomposition is to write down an operator (the “measurement operator”) which diagonalizes into these two subspaces (i.e. commutes with their projectors); physics being physics, frequently this is a nice operator (like position, momentum, etc.) which we then say the detector is “measuring”.

Dmitry Vaintrob 23 Jul 2026 4:15 UTC
4 points
0
in reply to: papetoast’s comment on: Have LLMs Generated Novel Insights?
Note the Goemans’ conjecture counterexample is a different Dmitry (Rybin) :)

Dmitry Vaintrob 22 Jul 2026 17:41 UTC
75 points
13
on: Dmitry Vaintrob’s Shortform
Addict misalignment

The openai incident is a surprising (to me) combination of goal-directed and myopic. As a recap, an openai model in alignment testing chained zero-day vulnerabilities to hack out of its environment and hacked into huggingface hoping to find information on how to solve its task there. To me this is different from how I typically imagine misbehavior. Roughly, I tend to think of the scary behaviors as either having long horizons (take over the world, and then solve the task—a plotter) or of being internally unaligned in the sense of reaching for a heuristic/proxy for the trained goal which is different from the goal (e.g. “eat more calories” as a proxy for the evolutionary objective—imagine a very child with extreme agency). Note that the latter behavior can happen even in RL: in my understanding most RL methods are, or at least can be approximately viewed as, an alternation of finding a good goal heuristic and then optimizing on that heuristic. In the former case, one expects heuristically “maximal planning” and in the latter case one expects myopia (since heuristics are frequently myopic).

In this case it seemed like the model was actually following the goal and optimizing for it with relatively short time horizons (i.e. myopically). This is similar to addict behavior, where an addict has a clear goal (obtain a drug dose) and perform goal-directed but relatively myopic actions to get it.

I know it’s fraught to try to “imagine being a model”. But I wonder how much the current iteration of misaligned behaviors can be understood as rational people with something like an intense craving to solve a goal in a limited horizon (likely in tokens, though not clear how time factors in if waiting is involved).

I think no matter how you spin it, the limit of this behavior is extremely dangerous (an addict with large time horizons or ambitious goals is a power seeker). But this is definitely not how I imagined early misalignment warning shots to look, and the dissonance is interesting—recording this here to see how close my intuitions are to those of AI psychology/ AI control experts

Dmitry Vaintrob 5 May 2026 15:16 UTC
2 points
0
in reply to: Benjamin Gerraty’s comment on: Learning zero, and what SLT gets wrong about it
Yes, for a linear neural net the RLCT is much lower. You in fact get similarly low RLCT if your activation function has a “sparse” Taylor series such as a theta function. If I’m not mistaken, in order to get a lower bound on the RLCT of type you need to assume that the Taylor series of the activation function has a positive density of nonzero terms.

Dmitry Vaintrob 5 May 2026 3:05 UTC
4 points
1
in reply to: Benjamin Gerraty’s comment on: Learning zero, and what SLT gets wrong about it
You are absolutely right—and the references are great. Do you happen to have access to copies that you can send? It’s a bit hard to know what’s proven and what’s not here since a lot of the papers are paywalled.

Sumio Watanabe actually emailed me and pointed this out as well. I had a cached memory of rlct(0) being width/2 (so dim/4) in the analytic activation case, which was incorrect. In fact in the paper Watanabe sent me there was only an upper bound, so I wrote up a quick note giving a rough lower bound of the same order. I was planning to update this post as soon as it’s on arxiv, but if the paper you mentioned has a lower bound then that’s great, and I can cite it.

I think this doesn’t change the fundamental issue though. The free energy here is bounded by until you reach n on the order of at least This has faster than any power law growth in the width. In fact you can show that in order for the RLCT to saturate here (i.e. to have reduced free energy at n points be within some fixed factor of ), you need width to be larger than an exponential in width,

Thanks a lot for this!

Dmitry Vaintrob 29 Apr 2026 17:05 UTC
4 points
0
in reply to: Alexander Gietelink Oldenziel’s comment on: Learning zero, and what SLT gets wrong about it
You got me excited—but no, that paper doesn’t have any effective theory in this sense. It’s still looking at pure geometry in the landscape, but taking an effective theory on the training signal by cutting off the infinite-data perplexity loss in different effective theory ways. Interesting paper, but not related to this issue. (I like that paper a lot btw and it’s related to stuff me and people I work with are interested in)

Dmitry Vaintrob 27 Apr 2026 0:49 UTC
8 points
2
in reply to: LawrenceC’s comment on: The paper that killed deep learning theory
Otherwise your picture makes sense. I think “learning theory” that I interact with is quite different from what’s typically encountered in interp world (and this needs fixing). In particular what you call the SLT insights are in fact much older and standard (and in general aren’t related to singularities)

Dmitry Vaintrob 27 Apr 2026 0:43 UTC
LW: 4 AF: 1
0
AF
in reply to: LawrenceC’s comment on: The paper that killed deep learning theory
Very good nitpicks. I definitely don’t know my physics history well (but even with my limited knowledge, I was gesturing at a cartoon level of understanding that mixed different early-20th-century pictures on different phenomena).

Re mean field—it’s not higher order, but lower order. Mean field is to NTK what classical mechanics is to quantum mechanics (in particular NTK + higher order corrections still has most of the bad generalization properties of NTK). The new insight is that while the expansion in NTK is always around a trivial classical theory, nontrivial classical theories also exist and are better-behaved from a complexity viewpoint

Dmitry Vaintrob 26 Apr 2026 21:22 UTC
LW: 26 AF: 9
1
AF
on: The paper that killed deep learning theory
I like this post and the “theory of deep learning” posts. But I think I still haven’t figured out how to model your view, especially the specifics of the pessimism here. Maybe we should discuss in person. In particular I’m not sure what “deep learning theory” encompasses.

My sense of mechinterp theory is that it’s similar to pre- standard model physics.

Heuristically, here’s a thought experiment. Suppose we’re worried about the sun destroying the earth and want to understand as much as possible about the physics of solar plasmas and supernovas; but we currently only have (a vaguely historical pastiche of) pre-WW2 physics. Physics then roughly had the following components:
1. idealized heuristics: if we view a big object in space as a classical blackbody, we get a good heuristic on some parts of its emission spectrum
2. new behaviors: there’s a consistent way that emission spectra aren’t classical blackbodies, in that they’re quantized. We have only a rough understanding of how and why, and in fact this observation spawned the discovery of quantum mechanics.
3. small toy examples: we can understand the hydrogen atom relatively well. There are some weird factors of 2 and corrections that we can only explain kind of heuristically, but except for these we have a clean, exact quantized spectrum. We see this spectrum in real life materials—but we also see that most of what comprises real materials isn’t hydrogen, and is much more complicated. Some stuff still looks roughly like they could be atomic spectra for other atoms or small molecules, but metal conducting bands are dominated by weird and clearly non-localized behaviors that we don’t understand (and the sun similarly has weird spectral phenomena).
4. limits. There’s a limit where the world is Newtonian, which is sometimes useful, but very inaccurate when modeling the sun. There’s a limit where the world is relativistic. This gives directionally good corrections for some stellar phenomena (e.g. redshift) but is not nearly enough. It seems that there are maybe other limits (like we can mostly blackbox nuclear phenomena at earth temperatures but not at solar temperatures). Most of our understanding comes from sloppily combining together different phenomena coming from the various limits of importance.
5. experimental tools: looking at emission spectra is a really low-bandwidth way to interact with behaviors of interest. While it gives interesting info that points to new phenomena, it at best tells us something about a very limited class of behaviors (photon absorption and emission). In order to understand “how QM works” we have to figure out new tools (maybe vacuum chambers and primitive colliders), and new ways to interpret the output of existing tools.
In our world, iterating on these techniques gave us the standard model (and we understood solar plasma and some basics of supernovas before this). I think the promise of theory is that analogs of these techniques (maybe: SAE, large-N limits, toys like mod-add) will give us robust mechanism-finding tools. I think a lot of criticism of theory sounds to me like someone who in that world is saying “none of the current tools explain the sun even approximately, so we’re on the wrong track”—but that’s not how theory works (until you’ve found a “critical mass” of the standard model, or at least all parts of it relevant in a field of interest, you’ll only be explaining tiny fractions of the observations). I know you’re not making this criticism, but I feel like currently you are flattening the different components above into one notion of “theory good vs. theory bad”.

I’d guess that you’re skeptical about whether the analogs of 1-5 in ML theory are actually useful for “making progress towards the standard model”, but I’m not sure from your post which of these you think is most lacking (or if this picture is even compatible with your criticism).

My guess is that your issue is wrt something like my #1: certain heuristics that people were excited about and hoped would explain generalization turned out to be more complicated. My view is that in modern theory, VC dimension is considered largely defunct in models with nontrivially interesting data (even as simple as mod-add), but I’m not sure why this is the important thing about theory. If you take a more modern theory like mean field or even NTK, it has a non-VC notion of generalization: e.g. NTK/ Gaussian processes can replicate generalization in mnist (related to some data spectrum properties), and mean field theory can (currently only on the Bayesian level—this is unpublished work with Kaarel) explain generalization on polynomially many samples of any mechanism that can be encoded in a small (algorithms) circuit. It’s of course not guaranteed to converge to the same mechanism, but has the same notion of learnable vs. un-learnable on a polynomial-vs-exponential complexity theory level. It also does replicate the correct modular addition generalizing algorithm (NTK does not).
What links here?
- Maybe I was too harsh on deep learning theory (three days ago) by LawrenceC (30 Apr 2026 6:57 UTC; 111 points)

Dmitry Vaintrob 5 Apr 2026 18:19 UTC
6 points
0
in reply to: J Bostock’s comment on: Mean field sequence: an introduction
Thanks! Yeah that’s right. My colleague Nischal Mainali uses the term cavity method for this bulk-system distinction (what I call “background-foreground” in the body). I think the term originally meant something a little more specific and spin-glassy but has become the term of art for all mean field settings, at least in certain stat-phys contexts?

And great question. If you have a large D-dimensional space of fields associated to neurons, you might a priori think that you would need something like exp(D) neurons to “fully sample” the distribution (i.e. get something that looks dense like my point cloud in the relevant space of fields). But in practice, mean field methods require much much fewer particles to be valid. This happens in physics of course (where one has an infinite-dimensional or huge space of macroscopic observables, but predictions from the infinite-dimensional limit are true already for pretty small systems).

In the NN context I’m working on a paper that explains why in mean field you actually need only polynomially many neurons (in the sample size or some complexity parameter) for the mean field prediction to be true to high order. A useful intuition here is that while the “cloud” of neurons is high-dimensional in general, the thing we ultimately care about for e.g. generalizability is accuracy on a random test input. Reductively, this means that the cloud is one-dimensional and the law of large numbers kicks in very soon. So say we abstractly know that a reasonable mean-field distribution of neurons exists, and the output is additive in the single-neuron field from this distribution to leading order (the standard cavity method assumption). Then it’s a distribution in some high-dimensional function space and might have low-probability regions, may require exponentially many neurons, etc. But if we have some sampler of this space and have sampled N neurons, we immediately have accuracy to within log(N)/\sqrt{N} on almost all inputs (just by usual CLT arguments—here it’s convenient to use bounded activations, which is why we’re using these in experiments). The nontrivial thing is actually proving that a “good cavity method distribution” exists and isn’t too crazy.

In the superposition case things are particularly nice. In the superposition setting I’ve looked at here https://www.lesswrong.com/posts/siu22scEfuKxpSgfK/a-tale-of-three-theories-sparsity-frustration-and, the different superposition components are just independent theories that interact via a mass term that encodes interferences. When the width is small we’re actually not at all in the usual mean field setting (the effective mean field is heavily modified to account for the small width), but the heuristic story is there

Dmitry Vaintrob 10 Feb 2026 17:05 UTC
24 points
1
in reply to: Linch’s comment on: Linch’s Shortform
Not sure, but I have definitely noticed that llms have subtle “nuance sycophancy” for me. If I feel like there’s some crucial nuance missing I’ll sometimes ask and LLM in a way that tracks as first-order unbiased and get confirmation of my nuanced position. But at some point I noticed this in a situation where there were two opposing nuanced interpretations and tried modeling myself as asking “first-order-unbiased” questions having opposite views. And I got both views confirmed as expected. I’ve since been paranoid about this.

Generally I recommend this move of trying two opposing instances of “directional nuance” a few times. Basically I ask something like “the conventional view is X. Is the conventional view considered correct by modern historians?” Where X was formulated in a way that can naturally lead to a rebuttal Y. And then for sufficiently ambiguous and interpretation-dependent pairs of X and X’, with fully opposing “nuanced corrections” Y and ¬Y. I’ve been pretty successful at this several times I think

Dmitry Vaintrob 6 Feb 2026 23:02 UTC
5 points
0
on: Strategy of von Neumann and strategy of Rosenbergs
I think a much more sympathetic and earlier proponent of the second policy would be Niels Bohr, or maybe Klaus Fuchs

Dmitry Vaintrob 29 Jan 2026 13:59 UTC
2 points
0
in reply to: Dmitry Vaintrob’s comment on: How Articulate Are the Whales?
Ah never mind. I just re-read your last sentence and it seems like the papers consider this—in particular if the ocean floor were a factor this effect would likely depend on depth. Very cool “citizen research” piece on your end!

Dmitry Vaintrob 29 Jan 2026 13:55 UTC
2 points
0
on: How Articulate Are the Whales?
Likely this is totally off base, but I wonder if you can distinguish beaming artifacts from enviromental distortion/ multipath effects where sounds interfere with themselves because of the environment (marine floor etc.). Based on a low-effort chatgpt interaction it seems like there are some studies of whales that measure the same sound in different locations. I wonder if there’s enough publicly available data to see how measurement location affects the distance between peaks

Dmitry Vaintrob 28 Jan 2026 23:59 UTC
10 points
4
in reply to: Mis-Understandings’s comment on: Ada Palmer: Inventing the Renaissance
Arguably the same is true of modern LLMs. Even a base model is not a “generic person” but a “generic text”. The model ranke-4b is also fine-tuned (at least on question formats and to stay in character). So it’s a reconstructed version
The base-model is an unpolished diamond: it is full of raw potential, but extracting its knowledge is not always an effortless undertaking since it does not respond to questions in a chat-formatted manner.

Dmitry Vaintrob 1 Nov 2025 20:35 UTC
17 points
2
on: LLM-generated text is not testimony
I like the analogy of a LARP. Characters in a book don’t have reputation or human-like brain states that they honestly try to represent—but a good book can contain interesting, believable characters with consistent motivation, etc. I once participated in a well-organized fantasy LARP in graduate school. I was bad at it but it was a pretty interesting experience. In particular people who are good are able to act in character and express thoughts that “the character would be having” which are not identical to the logic and outlook of the player (I was bad at this, but other players could do it I think). In my case, I noticed that the character imports a bit of your values, which you sometimes break in-game if it feels appropriate. You also use your cognition to further the character’s cognition, while rationalizing their thinking in-game. It obviously feels different from real life: it’s explicitly a setting where you are allowed and encouraged to break your principles (like you are allowed to lie in a game of werewolf, etc.) and you understand that this is low-stakes, and so don’t engage the full mechanism of “trying as hard as possible” (to be a good person, to achieve good worlds, etc.). But also, there’s a sense in which a LARP seems “Turing-complete” for lack of a better word. For example in this LARP, the magical characters (not mine) collaboratively solved a logic puzzle to reverse engineer a partially known magic system and became able to cast powerful spells. I could also imagine modeling arbitrarily complex interactions and relationships in an extended LARP. There would probably always be some processing cost to add the extra modeling steps, but I can’t see how this would impose any hard constraints on some measure of “what is achievable” in such a setting.

I don’t see hard reasons for why e.g. a village of advanced LLMs could not have equal or greater capability than a group of smart humans playing a LARP. I’m not saying I see evidence they do—I just don’t know of convincing systematic obstructions. I agree that modern LLMs seem to not be able to do some things humans could do even in a LARP (some kind of theory of mind, explaining a consistent thinking trace that makes sense to a person upon reflection, etc.) but again a priori this might just be a skill issue.

So I wonder in the factorization “LLM can potentially get as good as humans in a LARP” + “sufficiently many smart humans in a long enough LARP are ‘Turing complete up to constant factors’ ” (in the sense of in principle being able to achieve, without breaking character, any intellectual outcome that non-LARP humans could do), which part would you disagree with?

Dmitry Vaintrob 17 Oct 2025 5:45 UTC
5 points
0
in reply to: Cleo Nardo’s comment on: strawberry calm’s Shortform
Very cool, thanks! I agree that Dalcy’s epsilon-game picture makes arguments about ELO vs. optimality more principled

Dmitry Vaintrob 17 Oct 2025 2:14 UTC
4 points
2
in reply to: Cleo Nardo’s comment on: strawberry calm’s Shortform
I really like this question and this analysis! I think an extension I’d do here is to restrict the “3 reasonable moves” picture by looking at proposed moves of different agents in various games. My guess is that in fact the “effective information content” in a move at high-level play is less than 1 bit per move on average. If you had a big gpu to throw at this problem you could try to explicitly train an engine via an RL policy with a strong entropy objective and see what maximal entropy is compatible with play at different ratings

Dmitry Vaintrob 7 Sep 2025 22:35 UTC
87 points
8
on: Dmitry Vaintrob’s Shortform
SLT is a thermodynamic theory of Bayesian learning, but not the thermodynamic theory of Bayesian learning

SLT provides a rigorous mathematical framework for Bayesian learning in a certain regime, but I argue its practical applicability to real neural networks (even in a Bayesian learning/ high-level modeling context) is limited by finite-size effects and high-dimensionality. The valuable empirical work in this space is better understood as ‘thermodynamic interpretations of ML’ rather than validations of SLT proper

I’ve been having lots of conversations with people about SLT. I like SLT as a model for Bayesian learning a lot. At the same time I think that the assumptions of SLT are a model of the reality of learning (including Bayesian learning), in the same way that variants of the harmonic oscillator are a model of a physical system. There are some results that show that every physical system under some assumptions is a harmonic oscillator in a limit, but this limit doesn’t always hold.

I think the place where I am bothered by SLT rhetoric is where interesting experiments get done and interesting thermodynamic parameters get found, but instead of viewing this as results in “thermodynamic interpretation of ML”, there is an incorrect assumption that the observed phenomena are explained (fully or up to a controllable error) by an expansion around a singularity of a singular learning system.

I’m planning on writing more about this, but I’ll try to write out the key arguments to let people look at them and to see if I’m getting something wrong.

Essentially, I think it would be good to coordinate on a language for talking about Bayesian learning that doesn’t overindex on the singular learning limit, and instead uses physics terms (free energy, susceptibility, heat capacity) for the actual observed invariants. I think that in many ways SLT is moving in this direction, and I’m excited about what they have done.

First the things I agree with:
1. We are doing Bayesian learning (or a mild variant of “Boltzmann learning”, which allows rescaling the “size” of Bayesian updates by a scalar factor)
2. If we fix a neural network as a statistical system (its data distribution, architecture, loss) and take the number of data samples, n, to infinity, there is a limit where the singular learning prediction is true.
So there is always a regime where n is sufficiently large that SLT gives an exact answer to Bayesian learning. Why am I objecting to a (mathematically true) fact?

Essentially, the two key issues are that
- Neural networks are high-dimensional systems, and this makes errors in SLT approximation potentially very large (even exponentially large in something like number of parameters) at finite data.
- In order to get the exact Watanabe approximation to hold, we need to assume that for infinite data the singularity is exact. Thus if we have some predicted singular loss $L_{s i n g} (θ)$ depending on a weight choice $θ$ , then we can (at least wrt the guarantees given by SLT theory) use the singularities of $L_{s i n g}$ to get a prediction for the asymptotic only if $L_{s i n g} (θ) = L (θ)$ exactly in the infinite-data limit. If there is any small approximation issue that relates to the architecture / data rather than to the number of samples n (e.g. the difference between the discrete Fourier Transform and the continuous FT on the circle for modular addition, approximations of a polynomial by sigmoids in modular addition and other contexts, even bit complexity issues, etc.), then we can’t rely on the SLT prediction—at least theoretically. Here one might a priori expect that the singular information from an approximate model $L_{s i n g}$ of the loss would be predictive of the singularity of the true loss $L$ , but in fact this is false: the property of being nontrivially singular at all is a measure-zero property of a (loss) function, so if there are any modeling assumptions or possible sources of error or noise, we expect the “true infinite limit” Watanabe prediction to be the same one as for a “nonsingular” loss (i.e., positive-definite Hessian at the limit). Sometimes there are small corrections (often called “gauge”) from the architecture, but these are small compared to the empirical singularity-like effects one observes from thermodynamic measurements.
I want to especially harp on the fact that it takes at worst-case exponential number of samples, and therefore exponential loss precision of the difference $L_{s i n g} - L$ (more precisely, exp of some power of the number of parameters) to mathematically guarantee that the SLT prediction is correct. Thus an argument shaped like “SLT is mathematically correct” is, for realistic models, true but boring.

However my pair of counterarguments isn’t strong by itself. There are lots of cases in the context of mathematical modeling of complex systems where theory says that we might (in worst-case situations) need to wait exponentially long/ get exponential errors/ etc., but in practice finite time suffices. And I think that there are interesting systems where the modeling assumptions above are correct. It’s just that you shouldn’t assume this by default.

Thus it is meaningful to ask the following question:
- in what learning problems, what regimes, and at what scales can Bayesian learning be described by an SLT prediction? In other words, when can we find a function $L_{s i n g} (θ)$ that is a “sufficiently good” approximation of the true infinite-sample loss $L (θ)$ and whose singularities “meaningfully describe the thermodynamic behavior” of Bayesian learning.
I think this is an interesting question. One might object to it by saying that this is hard to measure, since Bayesian learning is extremely hard to study in “realistic” cases. However I’d argue that we have enough examples to start studying this. And again, my sense here is that SLT prediction fail already to first order (i.e., for predicting e.g. the correct asymptotic for $^λ$ up to O(1) rescaling), but they fail in an interesting way.

SLT failures in Bayesian learning.

I’ll write more about this later. But let me give two examples where I think this is the case.

Grokking modular addition (with MSE loss)

In this paper, grokking in (MSE) modular addition is analyzed using a technique called “mean field theory” (this is an extension of more classical work on things like the neural tangent kernel, which drops the highly restrictive assumption that the system is in a “lazy learning” regime: i.e., that the solution is a small perturbation away from a “trivial vacuum”). This is a paper about Bayesian learning and it gives exact predictions that are confirmed by experiment (one can empirically “approximate” Bayesian learning by something called Langeving SGD). The prediction in particular implies an expansion for the SLT term $^λ$ (the “heat capacity”) and for the free energy (roughly, the stat-phys version of what is called “basin volume” in SLT). It turns out that the free energy in the relevant regime is actually explicitly not controlled by any basin around a singularity, but is rather given to first order by a high-dimensionality phenomenon (similar to critical phenomena in thermodynamics). Specifically, the first-order approximation of the free energy is compatible with the system having some fixed number of degrees of freedom per neuron (here: “row of the input weight matrix”). More precisely this paper predicts (and experiment verifies) that to first order, the NN learns to randomly sample each “weight” row from some fixed distribution. (The resulting NN has approximately correct outputs by the central limit theorem; but if one is interested in a more realistic context with a small number of neurons compared to p, higher terms in the mean-field expansion let you predict regimes with higher and higher accuracy; the leading free-energy term will still be the same). This terms is, importantly, not even a rational number (as predicted by SLT), but some numerical integral associated with the neuron distribution (not dominated by any one value or any small set of moments, except in certain limiting regimes). Note that (similar to SLT) the mean-field prediction holds, in a certain idealization, for any “input number” n so long as n is less than some exponential in the number of neurons.

LoRA for empirical “approximately singular” matrices from ML

For our second example, we look at a two-layer “deep(ish) linear network” that tries to model $W_{i n} W_{o u t} = M$ , (here we replace the ReLU with a trivial nonlinearity, and take W_{in}, W_{out} to be the trainable parameters and M to be the fixed “target”), there is an exact formula for the Bayesian learning prediction of this network in terms of the singular values of M (“singular” is a good term here :). The Bayesian learning dynamics are entirely controlled by the “entropy” or “volume” function $V o l (s_{1}, s_{2}, . ., s_{d}),$ which is roughly the volume of the “space of pairs of matrices $W_{i n}$ , $W_{o u t}$ of bounded size whose product is M” (the “bounded size” is a secret parameter here, that is related to the “number of training points” parameter n in SLT).

If the (min of) input/output width is larger than the “hidden layer” width, the function $V o l (s_{1}, . . ., s_{d})$ has singularities and SLT in this case makes a nontrivial prediction: namely, that if some fraction of the singular values $s_{k} = s_{k + 1} = . . . = s_{d} = 0$ is zero (note we can assume WLOG that this is the last $d - k$ values, since singular values are un-ordered), then the function Vol has a singularity. SLT now implies that if the “target” matrix M has lower-than-full rank (i.e. some of the s_i are exactly 0) then in the high-data limit $n \to \infty$ we have an exact asymptotic on the free energy function F(n) (as a function of dataset size).

Here in order for the SLT asymptotic to be true, we want to assume that the first k singular vector of M are exactly zero and the data size parameter n is much larger than the inverse (square) of the smallest nonzero singular values (this is called the “spectral gap”). If we want some kind of exact or asymptotic formula for the free energy that makes valid predictions for n without the singular gap assumptions, or without the assumption that the singular values are exactly 0, we can simply plug the (easily computable) “true” singular values $s_{1}, \dots, s_{d}$ into the (known) exact formula—or a known expansion whose validity (/error bound) we can mathematically verify—and see what we get.

Ultimately this depends on what deep linear models one would want to model “in practice” in the context of ML. This might at first seem like a silly problem: ML is explicity nonlinear and any deep linear model is just a toy.

However, this is not entirely true: in some cases for a realistic neural net, one is interested in doing “LoRA” decomposition. LoRA means different things in different contexts, but one standard context is decomposing a weight matrix $W = W^{ℓ}$ that appears in some layer $ℓ$ of the model into a product of two matrices $W = W_{L} W_{R},$ with $W_{L}$ and $W_{R}$ having lower rank. In general in such a context, one would then train all the weights of the model together (and again, typically LORA is used in a slightly modified context where a low-rank product $W_{L} W_{R}$ somehow supplements rather than replacing an intermediate layer). Nevertheless, we can alway model “some part” of LoRA learning as learning an approximate factorization of a matrix $W$ into a product $W_{L} W_{R}$ (this corresponds to “freezing” all weight layers except those that defined $W$ , and modeling the loss as an approximation loss on $W$ -- sketchy, but I’d guess a reasonable “directional” guess for asymptotics in a suitable regime).

Once we have reduced to this problem, we can now write down the exact Bayesian prediction and the SLT prediction for various values of “dataset size” n. Both predictions depend only on the singular values $s_{1}, . . ., s_{d}$ of the matrix $W$ (this can be seen using symmetry). Now empirically, it turns out that one can approximate the “bulk” of these singular values relatively well up to a constant by a power law $s_{i} = i^{- α}$ for some exponent $α$ on the order of .3-.5 (see here for example). Here the “bulk” captures most of the singular values and it’s in some sense “pretty singular”: for the majority i, the value $i^{- α}$ is pretty small, i.e. “close to a singularity”.

Thus it makes sense to extrapolate this regime and ask whether, for a singular matrix with singularities following a power law $s_{i} \sim i^{- α},$ there is some regime of values of n where an SLT prediction (assuming that all $s_{i}$ below some cutoff are zero and keeping only the singular terms from these) actually describes the true value of the free energy to first order. In other words, we’re replacing the “real” model that approximately solves $W_{L} W_{R} = W$ by an idealized “singular” model that approximately solves $W_{L} W_{R} = W_{s i n g a}$ with the singular of $W_{s i n g}$ given by the formula $s_{i}^{'} = {\begin{matrix} 0, & i > i_{0} s_{i}, & i \leq i_{0} \end{matrix} .$

Unsurprisingly, the answer is a strong “no”: both the prediction that the “approximately singular” $s_{i}$ for $i > i_{0}$ are “essentially zero” and the prediction that the large $s_{i}$ for $i \leq i_{0}$ are large enough to impose a “reasonable gap” completely fail, and even the power law in the asymptotic from the SLT prediction fully fails to capture the power law of the true Bayesian learning prediction in this case. Here again, note that the key issue is the high-dimensionality of the system. (Also note here that I’m blackboxing a lot of math: happy to discuss it more in comments etc., and I’m also planning to write out a more careful version of this later.)

Discussion

One can try to salvage an SLT prediction in the LoRA example (which I think is particularly damning) in a few ways. Maybe:
- The important contribution comes from the “non-power law” part of the tail
- The assumption that LoRA rank reduction is a good model of Bayesian learning more generally is flawed
- etc.
However I think that together with the known results about modular addition, these failure modes show that in a meaningful way, realistic models do not have a good “singular model” with strong predictive power, even if we are only trying to make predictions about Bayesian learning. The regime where the SLT prediction succeeds certainly exists (this is a rigorous mathematical statement after all), but it requires too large a sample number n and imposes too strong of a regularity assumption on the “true” infinite-data loss landscape $L (θ)$ (essentially, an assumption that “small” and “large” phenomena are cleanly separated by a large gap) for it to even approximately give valid predictions in regimes we care about. Again, the “dominant” brunt of my intuition here is that this failure is related to the fact that the predictions of SLT assume that the number of parameters is fixed (i.e. O(1)) as the number of samples goes to infinity—but in reality, the number of parameters is quite high, and the regime where SLT predictions hold exactly is in some sense exponential (or very large) compared to parameter count, and thus never attained (and essentially uninteresting).

Having said this, I want to point out that SLT has a number of really good results that I am deeply excited about. I think they both have good theoretical results for toy models or strongly-asymptotic (but still interesting) regimes where the SLT approximation is exact, which may directionally give good intuition about realistic models (in the same way that the harmonic oscillator gives very deep intuition about quantum and statistical mechanics, despite not all systems being reducible to it—not that in particular, SLT can be understood as a “vastly generalized harmonic oscillator” with the essential property being the existence of a strongly position-localized—though singular—semiclassical approximation; ignore if these words are meaningless to you).

But also people who label their work as “singular learning theory” have really good empirical results where the measurements being made are strictly thermodynamic, and which continue to work in “explicitly non-SLT” contexts such as mean field theory, which capture the high-dimensionality and have totally different asymptotics than what is predicted in the SLT regime. (Note that mean-field is, of course, itself an approximation/ toy model!)

For example: timaeus (the organization that “does SLT”) has an elegant result by Garrett Baker et al. that measures the effects of changes in training distribution on susceptibilities; Nina Panickssery and I have a paper on the “lambda-hat estimator” (heat capacity) tracking the description length of the algorithm learned by modular addition in generalizing vs. memorizing models; there is work on saddles; Timaeus has produced work on the relationship of attention heads and a certain thermodynamic quantity. An upcoming paper by Timaeus that I’m very excited about looks at a susceptibility-like (“conjugate variable”) metric on inputs that generalizes influence functions.

All of these results are (I think) very good results directionally linking statistical mechanics invariants with interesting learning behaviors. Also, none of them are “properly SLT results”: they simply assume standard links between information theory, thermodynamics, and Bayesian learning (some of them developed in the learning context by Sumio Watanabe, who first discovered them in the context of his singular models). None of them assume that one is in a regime where the SLT “approximation from singularities” even approximately holds—and I suspect that if one were to dig even a little in most of these cases (e.g. look at larger-scale dependence on temperature, relationships between single-neuron statistics, etc.), then one would see that we are in a regime where in fact, in a strong sense, any “singular” toy model would fail to predict the behvior of interest in any regime or with any number of terms of the “singular” asymptotic expansion. (Essentially this belief is due to the fact that the lambda-hat estimator does not even approximately asymptote in the way that SLT predicts in the regimes of interest, and because in an intuition I have that I’m hoping to write down later, the SLT approach really fails to track inherently high-dimensional phenomena like grokking.)

I would be happy to be wrong about the failure of the SLT regime, and I retain hope that with sufficient processing/ renormalization, there is some sense in which the thermodynamic behavior of learning is extensively related to singularities in some suitably re-interpreted and lower-dimensional set of “relevant” variables which are not the weights.

Also I am excited about the work that Timaeus is doing and think that the “thermodynamics-aware” approach to learning that they try to follow is significantly underexplored. I just want to point out that there are very deep directions here that decouple from the assumptions of SLT and instead lean into high-dimensionality of the loss landscape, and which produce useful work that can be missed if one’s model of Bayesian learning theory is “everything is singularities”.
What links here?
- Shallow review of technical AI safety, 2025 by technicalities (17 Dec 2025 18:18 UTC; 195 points)
- Shallow review of technical AI safety, 2025 by technicalities (16 Dec 2025 10:42 UTC; 6 points)

Dmitry Vaintrob 28 May 2025 12:32 UTC
26 points
4
in reply to: Caleb Biddulph’s comment on: CBiddulph’s Shortform
This is fascinating! If there’s nothing else going on with your prompting, this looks like an incredibly hacky mid-inference intervention. My guess would be that openai applied some hasty patch against a sycophancy steering vector and this vector caught both actual sycophantic behaviors and descriptions of sycophantic behaviors in LLMs (I’d guess “sycophancy” as a word isn’t so much the issue as the LLM behavior connotation). Presumably the patch they used activates at a later token in the word “sycophancy” in an AI context. This is incredibly low-tech and unsophisticated—like much worse than the stories of repairing Apollo missions with duct tape. Even a really basic finetuning would not exhibit this behavior (otoh, I suppose stuff like this works for humans, where people will sometimes redirect mid-sentence).

FWIW, I wasn’t able to reconstruct this exact behavior (working in an incognito window with a fresh chatgpt instance), but it did suspiciously avoid talking about sycophancy and when I asked about sycophancy specifically, it got stuck in inference and returned an error

Dmitry Vaintrob

Addict misalignment

SLT is a thermodynamic theory of Bayesian learning, but not the thermodynamic theory of Bayesian learning

SLT failures in Bayesian learning.

Grokking modular addition (with MSE loss)

LoRA for empirical “approximately singular” matrices from ML

Discussion