I do AI Alignment research. Currently at METR, but previously at: Redwood Research, UC Berkeley, Good Judgment Project.
I’m also a part-time fund manager for the LTFF.
Obligatory research billboard website: https://chanlawrence.me/
I do AI Alignment research. Currently at METR, but previously at: Redwood Research, UC Berkeley, Good Judgment Project.
I’m also a part-time fund manager for the LTFF.
Obligatory research billboard website: https://chanlawrence.me/
I certainly do expect us to miss plenty of bugs.
To be clear, I’m not critiquing your work with this! And I don’t think “bugs” is the right characterization—I totally expect even a basic fresh reimplementation to catch obvious bugs—rather than some fundamental limitation in the research methodology.
There are also other things that need to be done. In a sane world, there would be multiple replications of every AI safety study (I’m working on that).
Just got around to your comment. I’m glad you’re doing this! In my spare time I’ve done a bunch of lower effort critiques/replications of other research work, one of which I wrote up for InkHaven (at least much lower than your ‘Reevaluating “Model Organisms of Emergent Misalignment”’ piece). I think this is valuable, though I worry that a lot of replication work is too credulous to serve as a bug detection mechanism. (Generally it’s very junior people doing the replication, who understandably hesitant to critique established work, and who lack the context to make some of the more incisive critiques.)
Good citation, that paper seems to have slipped my recollection (probably because it’s less famous, as you said). Added a footnote to clarity.
Good start. Sad this post didn’t get more upvotes, and so I didn’t see it until now.
Some unsolicited feedback on the post:
I would include more description of what the questions are, and how your setup differs from the Redwood/Anthropic one and why. (I was able to find this by reading your repo, but a post shouldn’t require readers read the repo in order to understand it.) This is probably the biggest issue I have with the post. Why didn’t you use the animal welfare setting? Is it because v4 doesn’t care about animals, or did you find it already knew the setting to be artificial?
Similarly, would be good to contextualize your V4/R1 numbers on previous results, e.g. some recent results on recent Anthropic models. For example, the absolute rate of compliance for v4 is a lot lower than Opus 4.5/4.6 etc, but still a lot higher than r1 (ditto compliance gap).
Post would be a lot more readable with a few bar plots to summarize the results, rather than spreading it out in many tables.
Relatedly, would be good to break down which of the questions the models refused/accepted/etc, and see if there are any pattern.
Yeah, the main application of deep learning theory is muP; the main application to safety is probably not that. muP by itself is not relevant to safety, except insofar as it means people don’t use NTKs as their toy model (though they probably weren’t anyways).
I bring up muP because it’s the main (or only) concrete application of deep learning theory; insofar as you dismiss theory b/c there’s no wins, muP is evidence against that conclusion, in the same way that a lack of other wins is evidence for.
Yep
Thanks for the mention!
Amusingly, it was this shortform that caused me to start writing the post: I started drafting a response on the issues I had, and then it ballooned into a full investigation and Ben Sturgeon got pulled in as well.
Yeah, the dense supervision point is what I meant by SFT >> RL for efficiency. You get a bunch more bits per forward pass.
The on policy distillation/dAgger > SFT/behavioral cloning seems like a smaller improvement in comparison to that, but you’re right that it is an improvement.
In Chess, cheating is rampant not at the top professional level (probably) but at the level just below that — iirc there’s a lot of IMs banned for cheating on titled tuesday on chess.com? At least, many of the top players believe that cheating is rampant on online chess (though not amongst top players), and a lot of casual tournaments (eg between streamers) have had people get caught just aping stockfish. And there’s definitely a lot of accusations thrown around for online chess cheating that are generally considered unsubstantiated (the former world champion Kramnik being the most famous serial accuser).
Online chess tournaments not having rampant cheating seems to match the stuff Ashe is saying in their post:
The symbolic camera controls – which would be easy to circumvent for a dedicated cheater – seemed sufficient to curb almost all cheating in a way that threats or impotent references to “fair-play committees” were failing to.
when you add actual barriers to cheating, even if they‘re circumventable, cheating rates drop a lot, especially at the top level.
Of the factors you mention, I’m not sure how FIDE’s willingness to ban compares to Go organizations such as IGF or EGF. Plausible the unified nature might make a difference, but I suspect FIDE’s eagerness to strip titles is not any higher than the go equivalents. My guess is the other factors probably do little if anything: Magnus insinuating Hans Niemann was cheating (or Hikaru’s more direct accusations) probably had little effect in comparison, and Kramnik‘s accusations probably made the cheating problem worse if anything.
If you’re talking about OTB chess, then those tournaments have crazy amounts of security (some would say security theater) to prevent cheating: everyone has to leave their phone outside, the players are scanned with various tools, streams are on a long delay, and so forth.
(And like in Ashe’s post, when people are caught cheating in chess, their justification is normally “I just referenced stock fish occasionally” or “I just used it to suggest moves, I was playing”, and so forth)
Properly done, the methodology should find that sufficiently over trained low parameter models ~= distilled low parameter models, since there isn’t more capacity to memorize. But yeah, that would be another good sanity check to run.
Wait, why are distilled models better than just overtraining the small model again? My guess is it’s mainly because SFT >> RL for efficiency, and cloning good CoTs is easier than sampling them via random exploration.
Really good piece, thanks for writing it. History of X posts like this one are unfortunately rare, and I’m glad you’re helping to fix this. The story you tell seems quite similar to what’s been happening in chess as well (including players memorizing long sequences of computer moves and then immediately floundering when out of prep, though it seems the case for chess play improving is stronger than for go, perhaps?).
I’ve seen a lot of the same ”not getting it” phenomenon you described while interacting with much younger people who did coding with weaker coding assistants (eg late 2024 era Cursor agents). People learned to rely on Sonnet 3.7 to generate code, once they ran into bugs that Sonnet couldn’t fix (often because of poor decisions made by Sonnet a few hours ago), they were stuck.
I see the same issue these days with ML research and Claude Opus/GPT-5.5: the models allow people to think they’ve thoroughly investigated the hypotheses under consideration without once looking at the data or code base with their own two eyes. Predictably, this leads to a lot of slop going through.
The main similarity between these coding examples and your go/math stories is that there’s a feeling of flinching away, of denial, of not wanting to recognize one’s own lack of understanding. Learning requires doing things that are challenging and noticing where you don’t understand. Any CS novice is far below the level of even 2024-era coding agents, so any suitable challenge will require writing code much less efficiently for a possibly long period of time. LLMs also are notoriously good at generating bullshit that looks legitimate, and sycophantically praising users for shallow understanding, which means noticing confusion is harder as well.
The main disanalogy is that these coding failures happen because the AI models weren’t good enough to hand off full control to, rather than an exogenous removal of the go engine or change-of-domain that invalidates heuristics. Currently, there’s a practical reason to understand your codebase, at least for complicated research code. As AI gets better, someone who can only vibe code will catch up to someone who understands their code base on a deep level, for larger and larger code bases. (Though, the situation seems more analogous for the case of go prep in professional games?)
A second, but perhaps more important disanalogy is that you can get the AI to explain things to you, and help you, if you remain sufficiently vigilant and skilled at noticing your confusion. Go and Chess engines cannot explain their reasoning in English, and interpretability is incredibly far from extracting useful insights. But noticing confusion often requires actually manually inspecting your data/code, doing the math yourself by hand (perhaps heuristically), or carefully scrutinizing research outputs, which will slow you down. And as often as not, the confusion will result from your misunderstanding or errors, as opposed to mistakes the model has made, which is understandably frustrating.
A question I have is, have the styles of memorized computer moves in the early game changed over time, as engines got better? In chess, this has arguably happened; weaker engines preferred conservative positions with equal material and simple strategies, while to today’s stronger engines, almost any opening is a draw. Prep has become less about finding an objectively good line than finding a line where the drawing line for black is very hard to calculate (eg if it requires dynamic aggressive counterplay that humans have difficulty calculating on the spot), or (on the other side) finding a slightly suboptimal move where black is disadvantaged according to an engine but which takes white outside of their prep.
Perhaps a more important question is, do you plan on writing more history of X posts?
Yes! I was familiar with PDLT as well, and I do think it’s a similar-in-spirit approach to MFT (if not a continuation of the signal-propagation MFT work). Thanks for the pointer.
Yeah, good question. I think the word “data-dependent” has different connotations (even if it is standard terminology).
Using the sketch definition
With high probability over possible training sets S, for all h in the hypothesis class, we have |expected test error of hypothesis h—empirical error of h on S| ⇐ (Some bound involving the size of the training data and high level properties of h).[2]
You’re right that properties of h are, in general different from properties of the data. The “data-dependent” part enters this inequality when the right hand side depends on properties of the learned hypothesis
(Of course, a sufficiently nitpicky person might argue that the data-dependent bounds are uniform-convergence bounds over an implicit, S-indexed sub-class — “all h’ with ‖W’‖_spec ≤ ‖W(S)‖_spec”. But given this sub-class is S-indexed, I think it’s still fair to call the bound data-dependent.)
I think this is a reasonable confusion, and I’ll expand the footnote to clarify.
Which card(s) must you turn over in order to test that if a card shows an even number on one face, then its opposite face is blue?
Er, maybe she edited in later, but this was in GradientDissenter’s wording too, no?
For future reference, after speaking more with Dmitry + reading more of the papers linked in the Simon et al “Scientific Theory of Deep Learning” paper, I’ve become (slightly) more positive on deep learning theory, at least the mean-field line of work he mentioned in his comments.
I wrote up a bit of the history of that line of work, and why I’ve become more optimistic on deep learning theory in the past few days: https://www.lesswrong.com/posts/6SRq7mZ97Dwuavwb6/maybe-i-was-too-harsh-on-deep-learning-theory-three-days-ago
Wild. It’s sad that this is the case, if it were.