I do AI Alignment research. Currently at METR, but previously at: Redwood Research, UC Berkeley, Good Judgment Project.
I’m also a part-time fund manager for the LTFF.
Obligatory research billboard website: https://chanlawrence.me/
I do AI Alignment research. Currently at METR, but previously at: Redwood Research, UC Berkeley, Good Judgment Project.
I’m also a part-time fund manager for the LTFF.
Obligatory research billboard website: https://chanlawrence.me/
I certainly do expect us to miss plenty of bugs.
To be clear, I’m not critiquing your work with this! And I don’t think “bugs” is the right characterization—I totally expect even a basic fresh reimplementation to catch obvious bugs—rather than some fundamental limitation in the research methodology.
There are also other things that need to be done. In a sane world, there would be multiple replications of every AI safety study (I’m working on that).
Just got around to your comment. I’m glad you’re doing this! In my spare time I’ve done a bunch of lower effort critiques/replications of other research work, one of which I wrote up for InkHaven (at least much lower than your ‘Reevaluating “Model Organisms of Emergent Misalignment”’ piece). I think this is valuable, though I worry that a lot of replication work is too credulous to serve as a bug detection mechanism. (Generally it’s very junior people doing the replication, who understandably hesitant to critique established work, and who lack the context to make some of the more incisive critiques.)
Good citation, that paper seems to have slipped my recollection (probably because it’s less famous, as you said). Added a footnote to clarity.
Good start. Sad this post didn’t get more upvotes, and so I didn’t see it until now.
Some unsolicited feedback on the post:
I would include more description of what the questions are, and how your setup differs from the Redwood/Anthropic one and why. (I was able to find this by reading your repo, but a post shouldn’t require readers read the repo in order to understand it.) This is probably the biggest issue I have with the post. Why didn’t you use the animal welfare setting? Is it because v4 doesn’t care about animals, or did you find it already knew the setting to be artificial?
Similarly, would be good to contextualize your V4/R1 numbers on previous results, e.g. some recent results on recent Anthropic models. For example, the absolute rate of compliance for v4 is a lot lower than Opus 4.5/4.6 etc, but still a lot higher than r1 (ditto compliance gap).
Post would be a lot more readable with a few bar plots to summarize the results, rather than spreading it out in many tables.
Relatedly, would be good to break down which of the questions the models refused/accepted/etc, and see if there are any pattern.
Yeah, the main application of deep learning theory is muP; the main application to safety is probably not that. muP by itself is not relevant to safety, except insofar as it means people don’t use NTKs as their toy model (though they probably weren’t anyways).
I bring up muP because it’s the main (or only) concrete application of deep learning theory; insofar as you dismiss theory b/c there’s no wins, muP is evidence against that conclusion, in the same way that a lack of other wins is evidence for.
Yep
Thanks for the mention!
Amusingly, it was this shortform that caused me to start writing the post: I started drafting a response on the issues I had, and then it ballooned into a full investigation and Ben Sturgeon got pulled in as well.
Yeah, the dense supervision point is what I meant by SFT >> RL for efficiency. You get a bunch more bits per forward pass.
The on policy distillation/dAgger > SFT/behavioral cloning seems like a smaller improvement in comparison to that, but you’re right that it is an improvement.
In Chess, cheating is rampant not at the top professional level (probably) but at the level just below that — iirc there’s a lot of IMs banned for cheating on titled tuesday on chess.com? At least, many of the top players believe that cheating is rampant on online chess (though not amongst top players), and a lot of casual tournaments (eg between streamers) have had people get caught just aping stockfish. And there’s definitely a lot of accusations thrown around for online chess cheating that are generally considered unsubstantiated (the former world champion Kramnik being the most famous serial accuser).
Online chess tournaments not having rampant cheating seems to match the stuff Ashe is saying in their post:
The symbolic camera controls – which would be easy to circumvent for a dedicated cheater – seemed sufficient to curb almost all cheating in a way that threats or impotent references to “fair-play committees” were failing to.
when you add actual barriers to cheating, even if they‘re circumventable, cheating rates drop a lot, especially at the top level.
Of the factors you mention, I’m not sure how FIDE’s willingness to ban compares to Go organizations such as IGF or EGF. Plausible the unified nature might make a difference, but I suspect FIDE’s eagerness to strip titles is not any higher than the go equivalents. My guess is the other factors probably do little if anything: Magnus insinuating Hans Niemann was cheating (or Hikaru’s more direct accusations) probably had little effect in comparison, and Kramnik‘s accusations probably made the cheating problem worse if anything.
If you’re talking about OTB chess, then those tournaments have crazy amounts of security (some would say security theater) to prevent cheating: everyone has to leave their phone outside, the players are scanned with various tools, streams are on a long delay, and so forth.
(And like in Ashe’s post, when people are caught cheating in chess, their justification is normally “I just referenced stock fish occasionally” or “I just used it to suggest moves, I was playing”, and so forth)
Properly done, the methodology should find that sufficiently over trained low parameter models ~= distilled low parameter models, since there isn’t more capacity to memorize. But yeah, that would be another good sanity check to run.
Wait, why are distilled models better than just overtraining the small model again? My guess is it’s mainly because SFT >> RL for efficiency, and cloning good CoTs is easier than sampling them via random exploration.
Really good piece, thanks for writing it. History of X posts like this one are unfortunately rare, and I’m glad you’re helping to fix this. The story you tell seems quite similar to what’s been happening in chess as well (including players memorizing long sequences of computer moves and then immediately floundering when out of prep, though it seems the case for chess play improving is stronger than for go, perhaps?).
I’ve seen a lot of the same ”not getting it” phenomenon you described while interacting with much younger people who did coding with weaker coding assistants (eg late 2024 era Cursor agents). People learned to rely on Sonnet 3.7 to generate code, once they ran into bugs that Sonnet couldn’t fix (often because of poor decisions made by Sonnet a few hours ago), they were stuck.
I see the same issue these days with ML research and Claude Opus/GPT-5.5: the models allow people to think they’ve thoroughly investigated the hypotheses under consideration without once looking at the data or code base with their own two eyes. Predictably, this leads to a lot of slop going through.
The main similarity between these coding examples and your go/math stories is that there’s a feeling of flinching away, of denial, of not wanting to recognize one’s own lack of understanding. Learning requires doing things that are challenging and noticing where you don’t understand. Any CS novice is far below the level of even 2024-era coding agents, so any suitable challenge will require writing code much less efficiently for a possibly long period of time. LLMs also are notoriously good at generating bullshit that looks legitimate, and sycophantically praising users for shallow understanding, which means noticing confusion is harder as well.
The main disanalogy is that these coding failures happen because the AI models weren’t good enough to hand off full control to, rather than an exogenous removal of the go engine or change-of-domain that invalidates heuristics. Currently, there’s a practical reason to understand your codebase, at least for complicated research code. As AI gets better, someone who can only vibe code will catch up to someone who understands their code base on a deep level, for larger and larger code bases. (Though, the situation seems more analogous for the case of go prep in professional games?)
A second, but perhaps more important disanalogy is that you can get the AI to explain things to you, and help you, if you remain sufficiently vigilant and skilled at noticing your confusion. Go and Chess engines cannot explain their reasoning in English, and interpretability is incredibly far from extracting useful insights. But noticing confusion often requires actually manually inspecting your data/code, doing the math yourself by hand (perhaps heuristically), or carefully scrutinizing research outputs, which will slow you down. And as often as not, the confusion will result from your misunderstanding or errors, as opposed to mistakes the model has made, which is understandably frustrating.
A question I have is, have the styles of memorized computer moves in the early game changed over time, as engines got better? In chess, this has arguably happened; weaker engines preferred conservative positions with equal material and simple strategies, while to today’s stronger engines, almost any opening is a draw. Prep has become less about finding an objectively good line than finding a line where the drawing line for black is very hard to calculate (eg if it requires dynamic aggressive counterplay that humans have difficulty calculating on the spot), or (on the other side) finding a slightly suboptimal move where black is disadvantaged according to an engine but which takes white outside of their prep.
Perhaps a more important question is, do you plan on writing more history of X posts?
Yes! I was familiar with PDLT as well, and I do think it’s a similar-in-spirit approach to MFT (if not a continuation of the signal-propagation MFT work). Thanks for the pointer.
Yeah, good question. I think the word “data-dependent” has different connotations (even if it is standard terminology).
Using the sketch definition
With high probability over possible training sets S, for all h in the hypothesis class, we have |expected test error of hypothesis h—empirical error of h on S| ⇐ (Some bound involving the size of the training data and high level properties of h).[2]
You’re right that properties of h are, in general different from properties of the data. The “data-dependent” part enters this inequality when the right hand side depends on properties of the learned hypothesis
(Of course, a sufficiently nitpicky person might argue that the data-dependent bounds are uniform-convergence bounds over an implicit, S-indexed sub-class — “all h’ with ‖W’‖_spec ≤ ‖W(S)‖_spec”. But given this sub-class is S-indexed, I think it’s still fair to call the bound data-dependent.)
I think this is a reasonable confusion, and I’ll expand the footnote to clarify.
Which card(s) must you turn over in order to test that if a card shows an even number on one face, then its opposite face is blue?
Er, maybe she edited in later, but this was in GradientDissenter’s wording too, no?
For future reference, after speaking more with Dmitry + reading more of the papers linked in the Simon et al “Scientific Theory of Deep Learning” paper, I’ve become (slightly) more positive on deep learning theory, at least the mean-field line of work he mentioned in his comments.
I wrote up a bit of the history of that line of work, and why I’ve become more optimistic on deep learning theory in the past few days: https://www.lesswrong.com/posts/6SRq7mZ97Dwuavwb6/maybe-i-was-too-harsh-on-deep-learning-theory-three-days-ago
Hm. I’m going to first nitpick your specific example of pre-standard model physics and the sun, then respond to the overall point.
My impression is that pre-WW2 physics had a lot more wins re: the sun than either deep learning theory circa 2016 or even current learning mechanics (as used in the Simon et al paper) has for current mech interp. (I’m not familiar with your unpublished mean field theory work, and I also haven’t been keeping up to date on Singular Learning Theory work.
From the spectrum lines we knew that the sun was made of hydrogen and helium by the late 1920s at the very latest: helium was first discovered in the sun in the late 1800s via its spectral lines, and Cecilia Payne-Gaposchkin’s PhD thesis Stellar Atmospheres that established the sun’s atmospheric composition was published in 1925.
Aston had built the mass spectrograph by 1920, which allowed him to measure the mass of helium and hydrogen nuclei. It turns out that four hydrogen nuclei (that is, 4 protons) weighed a statistically significantly amount more than a helium nuclei (Google tells me ~0.7%, though idk how precise their measurements were at the time). In the 1920s, Eddington argued that this mass defect was the source of energy for the sun (though he didn’t understand how the sun fused hydrogen into helium).
The Hertzprung and Russell diagram was plotted by the early 1900s. Perhaps a naive interpretation says that stars would move left and up along this diagram over time, though, Eddington’s 1920s work (and the existence of the mass-luminosity relationship between stars) suggested that the main sequence was a result of stars having different masses and not a thing they’d move along.
We had observations of white dwarfs by the 1920s, we knew they were held up by electron degeneracy pressure by the late 1926 (Pauli’s exclusion principle was published in 1925), and Chandrasekhar did the relativistic corrections to get the mass limit on white dwarves by 1930.
Chandrasekhar’s limit also suggested the existence of neutron stars, which were then discovered in 1930s. Baade and Zwicky’s 1934 paper then proposed that the transition from main sequence → neutron star was the reason for super nova.
Much earlier work in biology—eg the speed of evolution and age of the fossil record—already suggested that the sun and the earth have been around in something like their current configuration for hundreds of millions if not billions of years, which put pretty strong limits on the speed of stellar evolution.
The main missing piece for predicting the fate of the sun was that it wasn’t obvious how white dwarves came to be, or even when stars would leave the main sequence. (Probably the first explanation was from Chandrasekhar in 1942.) We were also missing a lot of facts about how exactly the sun fused hydrogen into helium and how supernovas would happen.
So I think that, while people did not know whether the sun would go nova or somehow transition into a white dwarf, we knew the composition of the sun (1), the mechanism through which it generated energy (2), had statistical relationships between stellar mass, luminosity, and temperature (3), had proposed two stellar remnants that the sun could eventually become (4, 5), and had a strong sense of what the time scale these events would happen at (6).
Arguably, deep learning theory circa 2016 had only a (6) analogue (namely, how well deep learning seemed to work in practice). Deep learning theory in 2026 has in addition its versions of (3) (scaling laws, which were first noted around 2020) and rudimentary analogues of (4) and (5) (NTKs/infinite width/depth limits in general, perhaps?). But we don’t have any analogues for 1 or 2 (which are the mechanisms through which SGD on deep neural networks produces smart systems) except in a very rudimentary way (something about circuits, gradient scales, and loss landscapes, perhaps?), and there’s a fair amount of evidence that our 4⁄5 analogues aren’t great.
Specifically, I think the relevant work for understanding the fate of the sun (at least on the scale of “does it go nova”) wasn’t so much the standard model as relativistic quantum mechanics + general relativity, a lot of which had been worked out by the 1930s. Spectral lines gave us the composition of the sun, e=mc^2 suggested the mechanism for energy generation, and relativistic quantum mechanics pointed to two possible stellar remnants.
I think there’s two ways to make your case go through, though:
You can argue that circa 1935, it’d seem that additional work on “pure” quantum mechanics would not help our understanding of the sun, while it was demonstrably very important. So from a “figure out if the sun will explode” standpoint, it might look like we shouldn’t be spending any additional research effort into QM in itself, which would cause us to miss a bunch of facts about the fate of the sun. Examples:
iirc you need a theory of weak interactions to compute the rate of p + p → d + e⁺ + ν in the sun.
we had no explanation of why the sun seemed to produce far fewer neutrinos than expected for decades, and this one actually does require the standard model to solve (though to be fair, we also didn’t notice the problem until 1960).
(Claude tells me that) the actual math for parts of the red giant → white dwarf transition for our sun involved a lot of complicated QM.
I think the closest analogue to the Zhang et al paper re: the Sun and physics probably happened with Kelvin’s calculations about the age of the sun and the earth. He didn’t know about radiation or e=mc^2, so his calculations gave the age of the sun and earth at something like 20 million years. This was far below the hundreds of millions-billions of years bound set by biology results at the time. Maybe then one would conclude that in the late 1800s that you should not study classical mechanics in order to understand the sun, or at least, that further investment in research like that was unlikely to produce relevant results.
Responses to these two cases:
It’s plausible that this argument might hold up—my technical QM knowledge basically ends with work in the 1940s. But I think it’s not as strong an argument as “your argument implies people shouldn’t study theoretical physics”, because so much of stellar evolution could be (and had been) worked out with just basic relativistic QM and general relativity. So a QFT person could point to the success of relativistic QM as evidence for why additional investment in QM research would further our understanding of the stars.
I think the analogous dismissal of theoretical research here is basically correct—I’m pretty sure that no amount of additional theoretical work without knowledge of radiation, QM, or relativity would’ve let Kelvin calculate the age of the sun correctly from first principles, let alone figured out whether it’d eventually explode.
To clarify, my case isn’t so much “don’t invest in theory research”—I actually think we should invest in much more theory research! -- as opposed to “I don’t expect classical learning theory approaches to work, and the standard academic response to this (learning mechanics) seems to have just dropped the level of ambition so much as to be uninteresting”. Specifically, I think any theoretical approach that seeks to explain large chunks of deep learning should be able to articulate a clear novel insight ala relativity or QM that differentiates it from pre-2019 learning theory, or else be treated with a level of skepticism that can only be overcome with interesting empirical results.
The Singular Learning Theory people have several insights of this form—e.g. the old learning theory both used a wrong notion of generalization and failed to account for degeneracies in the loss landscape. ARC Theory and the comp mech people both have their own versions of these insights as well. I suspect you can articulate something like this for your work with Kareel (though I haven’t seen it) as well—iirc it was something like “modern versions of mean field theory in physics can handle higher order interactions than the earlier approaches that people used for studying generalization” or something?
Some combination of:
The training procedure for random labels was much, much harder, e.g. 100x more steps (such that the x-axis had to be in a log scale to show it on the same graph as the true label case, a la original grokking/induction head results)
Neural networks couldn’t fit random labels at all, at least on the scale of the datasets they could generalize on.
Thanks for the response!
If by “impressive” in “the state of ‘deep learning theory’ is fairly impressive” you mean “there’s a big mathematical edifice, but not much of it is useful”, then I don’t disagree. Insofar as by “impressive”, you mean has concrete evidence of being “useful” or “at the right level of analysis”, I suspect we would strongly disagree.
(Note that I use the word “impressive” zero times in my post.)
It seems that you think deep learning theory is “impressive” because it contains sophisticated machinery that has done little in practice.
My first response is that deep learning theory in 2016 also had plenty of sophisticated machinery, but all of it turned out to be even more inapposite than current theory. Do you think that, a priori, without looking at the empirical results, you’d find the mathematical machinery of 2026 is sufficiently more sophisticated than in 2016? The additional sophistication of theory itself since 2016 was not the reason people started abandoning deep learning theory in 2016-2018; in what sense does the new work deserve additional credit for revealing that the line of work people didn’t think would work in 2016 also doesn’t work in 2026? In fact, a classic result from 2016 was that neural networks + SGD can easily fit random labels (which never generalize) -- surely this already conclusively demonstrates that any theory of neural network generalization must make reference to the structure of the specific hypothesis neural networks are asked to learn, if not the data itself (if the earlier no-free-lunch theorems from learning theory don’t count).
My second response is that, was the mathematical machinery really necessary for the insights you point out? The toy models around superposition have arguably been more productive than all of deep learning theory, in that techniques derived from them (SAE variants) are used in production at at least one frontier lab. This is despite the much lower degree of sophistication of such work. (Even math-heavy pieces like the Comp in Sup line of work are far less mathematically tedious than the tensor program calculations, and the productive part of the superposition did not come from the Comp in Sup line of work!) Similarly, you can derive much of the scaling results for muP with a simple toy model of multi-layered linear networks (as Yang does in either the 4th or 5th tensor program papers, iirc), just like you can derive the average neuron firing must activate on average exactly 1 neuron result with hardly any mathematical model of brain activity at all.
My third response is that, if deep learning theory has indeed not been useful in practice (and indeed, might have failed so conclusively that we should look elsewhere), what makes you think a theory of data would be useful in practice?
The implicit argument you seem to give is that some theory must exist that can adequately explain the success of large deep learning models. And it’s clearly not something that a theory of the learning process or network architecture can explain, so it must be a theory of data that can explain this. But why think there’s an adequate explanation in the first place, let alone one that must be a result of the data or the learning process?
In the final paragraph, you give an explicit argument for this.
Insofar as the intelligence of LLMs is the ability to generalize, the no-free-lunch theorems tell us that this generalization has to reflect common structure of the pre-training data and the fine-tuning task (duh!).
No-free-lunch theorems also tell us that this generalization has to reflect the inductive bias of LLMs and their training procedures! After all, a rock can’t generalize from any amount of pretraining or finetuning data to the real environment in any way but the trivial one.
e.g. a claim that fine-tuning is ‘conditioning nodes in the common sparse hierarchical latent world model’ is descriptive and substantive, but not enough that it is easily falsifiable.
This really does seem like you’re including a claim about how fine-tuning works on neural networks, not just what we’re fine-tuning them on!
I do think that, insofar as deep learning theory has failed so completely that it’s ruled out the relevance of the learning mechanism to generalization, then the people who demonstrated this deserve credit. But I don’t think it has failed to nearly that extent. And even if it had, I think I’d still respond critically to the paper, because the piece doesn’t make that case either, as opposed to a case in favor of learning mechanics being a field worth investing into.
Coming back years later to say: People in 2016 (when the Zhang et al paper was first released) did already know that neural networks were expressive (the work demonstrating neural networks with very high VC dimension occurred in the late 90s and early 2000s).
The hope at the time was not that neural networks themselves lack representativity, but that some combination of neural networks + SGD or neural networks + weight decay or something that people were doing on top of neural networks induced a strong prior against being able to fit random data points. The other important bit of context is that (as this review post demonstrates) a lot of work at the time was interested in constructing uniform bounds that worked regardless of what the learned hypothesis was, as long as the hypothesis was representable by the neural network + SGD/weight decay/smooth data manifold class.
By demonstrating that standard neural network techniques could learn incredibly overfit hypotheses, Zhang et al showed that whatever the class of hypotheses learnable by deep learning was, it included things that you could not get uniform generalization bounds out of, thus invalidating the dominant approach in the field at the time.
Wild. It’s sad that this is the case, if it were.