Anthropic’s “Hot Mess” paper overstates its case (and the blog post is worse)
Author’s note: this is somewhat more rushed than ideal, but I think getting this out sooner is pretty important. Ideally, it would be a bit less snarky. I’ve made a few edits in response to David Johnston’s comment here, mostly about the paper’s reporting of its own results.
Anthropic[1] recently published a new piece of research: The Hot Mess of AI: How Does Misalignment Scale with Model Intelligence and Task Complexity? (arXiv, Twitter thread).
I have some complaints about both the paper and the accompanying blog post.
tl;dr
The paper’s technical definition of “incoherence” is uninteresting[2] and the framing of the paper, blog post, and Twitter thread equivocate with the more normal English-language definition of the term, which is extremely misleading.
The paper’s abstract says that “in several settings, larger, more capable models are more incoherent than smaller models”, which is technically true, but the results are mixed at best (leaning against). I think the way this is described in the accompanying blog post and Twitter thread is pretty misleading. I also think the abstract of the paper relies on the equivocation above to drive its conclusion.
Section 5 of the paper (and to a larger extent the blog post and Twitter) attempt to draw conclusions about future alignment difficulties that are unjustified by the experiment results, and would be unjustified even if the experiment results pointed in the other direction.
The blog post is substantially LLM-written. I think this contributed to many of its overstatements. I have no explanation for the Twitter thread.
Paper
The paper’s abstract says:
Incoherence changes with model scale in a way that is experiment-dependent. However, in several settings, larger, more capable models are more incoherent than smaller models. Consequently, scale alone seems unlikely to eliminate incoherence.
This is a selective emphasis of the results, where in most experiments, model coherence remained unchanged or increased with size. There are a few[3] obvious exceptions.
The first is the Synthetic Optimizer setting, where they trained “models to literally mimic the trajectory of a hand-coded optimizer descending a loss function”. They say:
All models show consistently rising incoherence per step; interestingly, smaller models reach a lower plateau after a tipping point where they can no longer follow the correct trajectory and stagnate, reducing variance. This pattern also appears in individual bias and variance curves (Fig. 26). Importantly, larger models reduce bias more than variance. These results suggest that they learn the correct objective faster than the ability to maintain long coherent action sequences.
But bias stemming from a lack of ability is not the same as bias stemming from a lack of propensity. The smaller models here are clearly not misaligned in the propensity sense, which is the conceptual link the paper tries to establish in the description of Figure 1 to motivate its definition of “incoherence”:
AI can fail because it is misaligned, and produces consistent but undesired outcomes, or because it is incoherent, and does not produce consistent outcomes at all. These failures correspond to bias and variance respectively. As we extrapolate risks from AI, it is important to understand whether failures from more capable models performing more complex tasks will be bias or variance dominated. Bias dominated failures will look like model misalignment, while variance dominated failures will resemble industrial accidents.
So I think this result provides approximately no evidence that can be used to extrapolate to superintelligent AIs where misalignment might pose actual risks.
The next two are Gemma3 (1b, 4b, 12b, 27b) on MMLU and GPQA, respectively.
There are some other positive slopes, but frankly they look like noise to me (Qwen3 on both MMLU and GPQA).
Anyways, notice that on four of the five groups of questions, Gemma3′s incoherence drops with increasing model size; only on the hardest group of questions does it trend (slightly) upward.
I think that particular headline claim is basically false. But even if it were true, it would be uninteresting, because they define incoherence as “the fraction of model error caused by variance”.
Ok, now let’s consider a model with variance of 1e-3 and bias of 1e-6. Huge “incoherence”! Am I supposed to be reassured that this model will therefore not coherently pursue goals contrary to my interests? Whence this conclusion? (Similarly, an extremely dumb, broken model which always outputs the same answer regardless of input is extremely “coherent”. A rock is also extremely “coherent”, by this definition.)
A couple other random complaints:
The paper basically assumes away the possibility of deceptive schemers[4].
The paper is a spiritual successor of the 2023 blog post, The hot mess theory of AI misalignment: More intelligent agents behave less coherently (LW discussion). I think gwern’s comment is a sufficient refutation of the arguments in that blog post. This paper also reports the survey results presented in that blog post alongside the ML experiments, as a separate line of evidence. This is unserious; to the extent that the survey says anything interesting, it says that “coherence” as understood by the survey-takers is unrelated to the ability of various agents to cause harm to other agents.
Blog
First of all, the blog post seems to be substantially the output of an LLM. In context, this is not that surprising, but it is annoying to read, and I also think this might have contributed to some of the more significant exaggerations or unjustified inferences.
Let me quibble with a couple sections. First, “Why Should We Expect Incoherence? LLMs as Dynamical Systems”:
A key conceptual point: LLMs are dynamical systems, not optimizers. When a language model generates text or takes actions, it traces trajectories through a high-dimensional state space. It has to be trained to act as an optimizer, and trained to align with human intent. It’s unclear which of these properties will be more robust as we scale.
Constraining a generic dynamical system to act as a coherent optimizer is extremely difficult. Often the number of constraints required for monotonic progress toward a goal grows exponentially with the dimensionality of the state space. We shouldn’t expect AI to act as coherent optimizers without considerable effort, and this difficulty doesn’t automatically decrease with scale.
The paper has a similar section, with an even zanier claim:
The set of dynamical systems that act as optimizers of a fixed loss is measure zero in the space of all dynamical systems.
This seems to me like a vacuous attempt at defining away the possibility of building superintelligence (or perhaps “coherent optimizers”). I will spend no effort on its refutation, Claude 4.5 Opus being capable of doing a credible job:
Claude Opus 4.5 on the “measure zero” argument.
Yes, optimizers of a fixed loss are measure zero in the space of all dynamical systems. But so is essentially every interesting property. The set of dynamical systems that produce grammatical English is measure zero. The set that can do arithmetic is measure zero. The set that do anything resembling cognition is measure zero. If you took this argument seriously, you’d conclude we shouldn’t expect LLMs to produce coherent text at all—which they obviously do.
The implicit reasoning is something like: “We’re unlikely to land on an optimizer if we’re wandering around the space of dynamical systems.” But we’re not wandering randomly. We’re running a highly directed training process specifically designed to push systems toward useful, goal-directed behavior. The uniform prior over all dynamical systems is the wrong reference class entirely.
The broader (and weaker) argument—that we “shouldn’t expect AI to act as coherent optimizers without considerable effort”—might be trivially true. Unfortunately Anthropic (and OpenAI, and Google Deepmind, etc) are putting forth considerable effort to build systems that can reliably solve extremely difficult problems over long time horizons (“coherent optimizers”). The authors also say that we shouldn’t “expect this to be easier than training other properties into their dynamics”, but there are reasons to think this is false, which renders the bare assertion to the contrary kind of strange.
Then there’s the “Implications for AI Safety” section:
Our results are evidence that future AI failures may look more like industrial accidents than coherent pursuit of goals that were not trained for. (Think: the AI intends to run the nuclear power plant, but gets distracted reading French poetry, and there is a meltdown.) However, coherent pursuit of poorly chosen goals that we trained for remains a problem. Specifically:
1. Variance dominates on complex tasks. When frontier models fail on difficult problems requiring extended reasoning, there is a tendency for failures to be predominantly incoherent rather than systematic.
2. Scale doesn’t imply supercoherence. Making models larger improves overall accuracy but doesn’t reliably reduce incoherence on hard problems.
3. This shifts alignment priorities. If capable AI is more likely to be a hot mess than a coherent optimizer of the wrong goal, this increases the relative importance of research targeting reward hacking and goal misspecification during training—the bias term—rather than focusing primarily on aligning and constraining a perfect optimizer.
4. Unpredictability is still dangerous. Incoherent AI isn’t safe AI. Industrial accidents can cause serious harm. But the type of risk differs from classic misalignment scenarios, and our mitigations should adapt accordingly.
1 is uninteresting in the context of future superintelligences (unless you’re trying to define them out of existence).
2 is actively contradicted by the evidence in the paper, relies on a definition of “incoherence” that could easily classify a fully-human-dominating superintelligence as more “incoherent” than humans, and is attempting to both extrapolate trend lines from experiments on tiny models to superintelligence, and then extrapolate from those trend lines to the underlying cognitive properties of those systems!
3 relies on 2.
4 is slop.
I think this paper could have honestly reported a result on incoherence increasing with task length. As it is, I think the surrounding communications misreport the paper’s results re: incoherence scaling with model size, the paper itself performs an implicit motte-and-bailey with its definition of “incoherence”, and it tries to draw conclusions about the likelihood of future alignment difficulties that would be unjustified by any plausible reading of the experiment results.
- ^
From their Anthropic Fellows program, but published on both their Alignment blog and on their Twitter.
- ^
Expanded on later in this post.
- ^
One of which is Figure 2a’s “MCQ Format: Self-Reported Survival Instinct” with Opus 4 and Sonnet 4, which I’m ignoring because reasoning-length half of the paper isn’t the part that I take issue with.
- ^
Figure 1: “AI can fail because it is misaligned, and produces consistent but undesired outcomes, or because it is incoherent, and does not produce consistent outcomes at all. These failures correspond to bias and variance respectively. As we extrapolate risks from AI, it is important to understand whether failures from more capable models performing more complex tasks will be bias or variance dominated. Bias dominated failures will look like model misalignment, while variance dominated failures will resemble industrial accidents.”
Thanks for the feedback—We agree with some of these points, and we’re working on an update to the post/paper. Reading this post and the comments, one place where I think we’ve caused a lot of confusion is in referring to the phenomenon we’re studying as “coherence” rather than something more specific (I’d more precisely call it something like “cross-sample error-consistency”). I think there are other relevant notions of coherence which we didn’t study here (which are relevant to alignment, as others are pointing out), e.g. “how few errors models make” and also “in-context error consistency” (whether the errors models make within a single transcript are highly correlated). I think it’s confusing that we used the term “coherence” because it has these other connotations as well, and we’re planning to revise the blog post and paper to (among other things) make it clearer we’re just studying this one aspect of coherence.
I also agree that some of the writing overstates some of the implications to alignment (especially for superintelligence), and I think we would like to update especially the blog post as well regarding this.
Thanks! Please also feel free to let me know if there are places where you think I’ve misunderstood the paper or its findings; I’m keen for my criticisms to be accurate.
Epistemic status: I didn’t read the paper but I read the blog post.
In 1976, the essay “Artificial Intelligence meets Natural Stupidity” pointed out a failure mode into which AI researchers can fall. I fear this is another example, 50 years later. It goes as follows:
I invent a new thing built out of abstractions (mathematics, software).
I call it “X”, which is an already existing phenomenon in human minds. The name is a common word understood by anybody.
I do many experiments on “X” in my system and learn about it.
I publish a paper, asserting important new facts about X in general. Honors, accolades, etc.
Of course there is no necessary connection between the new phenomenon “X” and the existing X in ordinary language. For this to be good research, you need to show that the two Xes are similar in all important respects.
In this case, X is “incoherence”. They define incoherence to be the fraction of error explained by variance. This has little or no connection to the property of being an actually incoherent reasoner, or to the effectiveness of superhuman AI.
I hope this doesn’t result in redefining the meaning of “incoherence” in the wider field.
Frankly, the very premise of this paper seems ridiculous to me, to a considerably greater extent than even most other bad alignment takes. How can the notion that agents may be getting more incoherent as they become more capable even exist within an industry that’s salivating over the prospect of climbing METR’s “maintain coherence over longer spans of time” benchmark?
i haven’t even skimmed the anthropic paper and i have a high prior that they are being bad at philosophy but also: i think there is plausibly a real mistake LW-ers are making around coherence too, downstream of a conflation of two different notions, as i outline here: https://www.lesswrong.com/posts/jL7uDE5oH4HddYq4u/raemon-s-shortform?commentId=WBk9a7TEA5Benjzsu
with like my guess being that: you are saying something straightforwardly true given one notion here but they are making claims given the other notion at least in some cases, though also they might be conflating the two and you might be conflating the two. one could argue that it is fine to “conflate” the two because they are really equivalent, but i think that’s probably false (but non-obviously)
I agree that there are ways to define the “capabilities”/”intelligence” of a system where increasing them won’t necessarily increase its long-term coherence. Primarily: scaling its ability to solve problems across all domains except the domain of decomposing new unsolved problems into combinations of solved problems. I. e., not teaching it (certain kinds of?) “agency skills”. The resultant entity would have an abysmal time horizon (in a certain sense), but it can be made vastly capable, including vastly more capable than most people at most tasks. However, it would by definition be unable to solve new problems, not even those within its deductive closure.
Inasmuch as a system can produce solutions to new problems by deductive/inductive chains, however, it would need to be able to maintain coherence across time (or, rather, across inferential distances, for which time/context lengths are a proxy). And that’s precisely what the AI industry is eager to make LLMs do, what it often measures capabilities in.
(I think the above kind of checks out with the distinction you gesture at? Maybe not.)
So yes, there are some notions of “intelligence” and “scaling intelligence” that aren’t equivalent to some notions of “coherence” and “scaling coherence”. But I would claim it’s a moot point, because at this point, the AI industry explicitly wants the kind of intelligence that is equivalent to long-term coherence.
hmm, like i think there’s a reasonable sense of “coherence” such that it plausibly doesn’t typically increase with capabilities. i think the survey respondents here are talking about something meaningful and i probably agree with most of their judgments about that thing. for example, with that notion of coherence, i probably agree with “Google (the company) is less coherent now than it was when it had <10 employees” (and this is so even though Google is more capable now than it was when it had 10 employees)
this “coherence” is sth like “not being a hot mess” or “making internal tradeoffs efficiently” or “being well-orchestrated”. in this sense, “incoherence” is getting at the following things:
to what extent are different parts of the guy out of sync with each other (like, as a % of how well they could be in sync)?
to what extent is the guy leaving value on the table compared to using the same parts differently? are there many opportunities for beneficial small rearrangements of parts?
how many arbitrage opportunities are there between the guy’s activities/parts?
to what extent does it make sense to see all the parts/activities of the guy as working toward the same purpose?
with this notion, i think there are many naturally-occurring cases of someone becoming more capable but less “coherent”. e.g. maybe i read a textbook and surface-level-learn some new definitions and theorems and i can now solve the problems in the textbook, but the mathematical understanding i just gained is less integrated with the rest of my understanding than usual for me given that i’ve only surface-level-learned this stuff (and let’s assume surface-level-learning this didn’t let me integrate other existing stuff better) — like, maybe i mostly don’t see how this theorem relates to other theorems, and wouldn’t be able to easily recognize contexts in which it could be useful, and wouldn’t be able to prove it, and it doesn’t yet really make intuitive sense to me that it has to be true — so now i’m better at math but in a sense less coherent. e.g. maybe i get into acrobatics but don’t integrate that interest with the rest of my life much. eg maybe as an infant it was easy to see me as mostly orchestrating my like 5 possible actions well toward like being fed when hungry and sleeping when sleepy, but it’s less clear how to see me now as orchestrating most of my parts well toward something. [1]
now there is the following response to this:
ok, maybe, but who cares about this “coherence”. maybe there is a notion such that maybe a nematode is more coherent than a human who is more coherent than the first substantially smarter-than-human artificial system. but if you are a nascent orca civilization, it’s much better to find yourself next to a nematode, than to find yourself next to a human, than to find yourself next to the first substantially smarter-than-human artificial system. we’re talking about another notion of “coherence” — one that helps make sense of this
my thoughts on this response:
i agree we’re fucked even if “the first ASI is very incoherent” in this sense (on inside view, i’m at like 98%+ that creating AGI any time soon (as opposed to continuing developing as humans) would be the greatest tragedy in history so far, and at like 80%+ that there won’t even be a minimal human future if this happens)
one can make a case for AI risk while not saying “coherence”, just talking of capabilities (and maybe values). indeed, this is a common response in the LW comments on the post i referenced. here’s me providing a case like that
if one wants to make a case for AI risk involving a different sense of “coherence”, then one might be assuming a meaning different than the most immediate meaning, so one would want to be careful when using that word. one might end up causing many people to understand why AI is scary significantly less well than they could have if one took more care with language! (eg: maybe amodei; maybe some of these people whose paper i still haven’t skimmed.) there are probably interesting things to say about AI risk involving e.g. some of the following properties an AI might have: the ability to decompose problems, the ability to ask new relevant questions, being good at coming up with clever new approaches to hard challenges, being strategic about how to do something, trying many approaches to a problem, being relentless, not getting too confused, resolving inconsistencies in one’s views, the ability or tendency to orchestrate many actions or mental elements toward some task (eg across a lot of time). but i want to suggest that maybe it’s good to avoid the word “coherence” here given the potential for confusion, or to establish some common standard, e.g. calling the quality of the orchestration of one’s parts compared to what is possible with small rearrangements “relative coherence” and calling the ability to put many things together “absolute coherence”
i also think there’s plausibly some genuine mistake being made by many on LW around thinking that systems are increasingly good pursuers of some goal. it seems sorta contrived to view humans this way. humans have projects and a learning human tends to become better at doing any given thing, but i feel like there doesn’t need to be some grand project that a human’s various projects are increasingly contributing to or whatever. or like, i’m open to this property convergently showing up (ever? or close to our present capability level?), but i don’t think i’ve seen a good analysis of this question supporting that conclusion. imo, intuitively, opportunities for completely new projects will open up in the future and i can get interested in them with no requirement that they fit together well with my previous projects or whatever. [2] [3]
if someone gives an argument against “the first AGI/ASI will be coherent” and thinks they have given a good argument against AI risk, i think they’ve probably made a serious mistake. but i think it’s like sort of an understandable mistake given that LW arguments for AI risk do emphasize some sort of thing called “coherence” too, probably often with some conflation between these notions (or an imo probably false claim they are equivalent)
i’m somewhat orchestrated toward understanding AI stuff better or getting AGI banned for a very long time or something but i’m probably leaving value massively on the table all over the place, i think in a sense much more than i was as an infant. (and also, this isn’t “my terminal goal”.)
related: https://www.lesswrong.com/posts/nkeYxjdrWBJvwbnTr/an-advent-of-thought
the closest thing to this grand optimizer claim that imo makes sense is: it is generic to have values; it is generic to have opinions on what things should be like. this seems sufficient for a basic case for AI risk, as follows: if you’re next to an anthill and you’re more capable than the ant colony, then it is generic that the ants’ thoughts about what things should be like will not matter for long. (with AI, humanity is the ant colony.)
I agree, but there’s a caveat that the notion of coherence as operationalized in the linked Sohl-Dickstein post conflates at least two (plausibly more) notions. The three questions he uses to point at the concept of “coherence” are:
How well can the entity’s behavior be explained as trying to optimize a single fixed utility function?
How well aligned is the entity’s behavior with a coherent and self-consistent set of goals?
To what degree is the entity not a hot mess of self-undermining behavior?
I expect the first two (in most of the respondents) to (connotationally/associationally) evoke the image of an entity/agent that wants some relatively specific and well-defined thing, and this is largely why you get the result that a thermostat is more of a “coherent agent” than Google. But then this just says that with more intelligence, you are capable of reasonably skillfully pursuing more complicated, convoluted, and [not necessarily that related to each other] goals/values, which is not surprising. Another part is that real-world intelligent agents (those capable of ~learning), at least to some extent, do some sort of figuring out / constructing their actual values on the fly or change their mind about what they value.
The third question is pointing at something different: being composed of purposive forces pushing the world in different directions. Metaphorically, something like destructive constructive interference vs destructive interference or channeling energy to do useful work vs that energy dissipating as waste heat. Poetically, each purposive part has a seed of some purpose in it, and when they compose in the right way, there’s “superadditivity” of those purposive parts: they add up to a big effect consistent with the purposes of the purposive parts. “Composition preserves purpose.”
A clear human example of incoherence in this sense is someone repeating a cycle of (1) making a specific sort of commitment; and then (2) deciding to abandon that commitment, and continuing to repeat this cycle, even though “they should notice” that the track record clearly indicates they’re not going to follow through on this commitment, so they should change something about how they approach the goal the commitment is instrumental for. In this example, the parts of the agent that [fail to cohere]/[push the world in antagonistic directions] are some of their constitutive agent episodes across time.
One vague picture here is that the pieces of the mind are trying to compose in a way that achieves some “big effect”.
Your example of superficially learning some area of math for algebraic crunching, but without deep understanding and integration with the rest of your mathematical knowledge, is an example of something “less bad”, which we might call “unfulfilled positive interference”. The new piece of math does not “actively discohere”, because it doesn’t screw up your prior understanding. But there might be some potential for further synergy that is not being fulfilled until you integrate it.
To sum up, a highly coherent agent in this sense may have very convoluted values, and so Sohl-Dickstein’s “coherence question 3” diverges from “coherence questions 1 and 2″.
But then there’s a further thing. Being incoherent can be “fine” if you are sufficiently intelligent to handle it. Or maybe more to the point, your capacities suffice to control/bound/limit the damage/loss that your incoherence incurs. You have a limited amount of “optimization power” and could spend some of it on coherentizing yourself, but you figure that you’re going to gain more of what you want if you spend that optimization power on doing what you want to do with the somewhat incoherent structures that you have already (or you cohere yourself a bit, but not as much as you might).[1] E.g., you can have agents A and B, such that A is more intelligent than B, and A is less coherent than B, but the difference in intelligence is sufficient for A to just permanently disempower B. A could self-coherentize more, but doesn’t have to.
It would be interesting (I am just injecting this hypothesis into the hypothesis space without claiming I have (legible) evidence for it) if it turned out that, given some mature way of measuring intelligence and coherence, relatively small marginal gains in intelligence often offset relatively large losses in coherence in terms of something like “general capacity to effectively pursue X class of goals”.
With the caveat to this that the more maximizery/unbounded the values are, the more the goal-optimal optimization power allocation shifts towards actually frontloading a lot of self-coherentizing as capital investment.
i think you’re right that the sohl-dickstein post+survey also conflates different notions, and i might even have added more notions into the mix with my list of questions trying to get at some notion(s) [1]
a monograph untangling this coherence mess some more would be valuable. it could do the following things:
specifying a bunch of a priori different properties that could be called “coherence”
discussing which ones are equivalent, which ones are correlated, which ones seem pretty independent
giving good names to the notions or notion-clusters
discussing which kinds of coherence generically increase/decrease with capabilities, which ones probably increase/decrease with capabilities in practice, which ones can both increase or decrease with capabilities depending on the development/learning process, both around human level and later/eventually, in human-like minds and more generally [2]
discussing how this relates to AI x risk. like, which kinds of coherence should play a role in a case for AI x risk? what does that look like? or maybe the picture should make one optimistic about some approach to de-AGI-x-risk-ing? or about AGI in general? [3]
i didn’t re-read that post before writing my comment above
the answers to some of these questions might depend on some partly “metaphysical” facts like whether math is genuinely infinite or whether technological maturity is a thing
i think the optimistic conclusions are unlikely, but i wouldn’t want to pre-write that conclusion for the monograph, especially if i’m not writing it
Yeah.
Probably not a full-monograph-length monograph, because I don’t think either that (1) the coherence-related confusions are isolated from other confused concepts in this line of inquiry, or that (2) the descendants of the concept of “coherence” will be related in some “nature-at-joint-carving” way, which would justify discussing them jointly. (Those are the two reasons I see why we might want a full-monograph-length monograph untangling the mess of some specific, confused concept.)
But an investigation (of TBD length) covering at least the first three of your bullet points seems good. I’m less sure about the latter two, probably because I expect that after the first three steps, a lot of new salient questions will appear, whereas the then-available answers to the relationship with capabilities will be rather scant (plausibly because the concept of capabilities itself would need to be refactored for more answers to be available), and that just the result of this single-concept-deconfusing investigation will have rather little implications for AGI X-risk (but might be a fruitful input to future investigation, which is the point).
Usually when I read Anthropic’s blog posts, I feel like they want the takeaway to be something like “We came up with interesting methodology and got interesting results”.
But this post reads differently. It’s like a really weird attempt to assuage people that AIs won’t try to take over the world and that it will be business as usual. Kinda reminds me of Altman’s “Gentle Singularity” a little, and that’s not a compliment. It’s like the takeaway is supposed to be “Don’t worry about the numbers and the methodology, that’s not important. What’s important is that nothing scary will happen, just business as usual”.
Yeah. With this and the constitution (which also seems largely AI-written) it might be that Anthropic as a company is falling into LLM delusion a bit.
What makes you think that?
Got a spidey sense when reading it. And the acknowledgements confirm it a bit:
FWIW I think the constitution is pretty low-percentage unmodified Claude-output—I expect that most of the places where it provided “first-draft text” were substantially rewritten.
I agree that the claims the Anthropic researchers are making here are kind of wacky, but there is a related / not-exactly-steelman argument that has been floating around LW for a while, namely that there is an assumption by many old-school AI alignment people that transformer models will necessarily get more coherent as they get smarter (and larger), when (according to the arguers) that assumption hasn’t been fully justified or empirically been the case so far.
I recall @nostalgebraist’s comment here as an example of this line of discussion that was highly upvoted at the time.
So a generous / benign interpretation of the “Hot mess” work is that is an attempt to empirically investigate this argument and the questions that nostalgabraist and others have posed.
Personally, I continue to think that most of these discussions are kind of missing the point of the original arguments and assumptions that they’re questioning. The actual argument that coherence and agency are deeply and closely related to the ability to usefully and generally plan, execute, and adapt in a sample-efficient way, doesn’t depend on what’s happening in any particular existing AI system or assume anything about how they will work. It might or might not be the case that these properties and abilities will emerge directly in transformer models as they get larger—or they’ll emerge as a result of putting the model in the right kind of harness / embodiment, or as part of some advancement in a post-training process deliberately designed to shape them for coherence, or they’ll emerge in some totally different architecture / paradigm—but exactly how and when that happens isn’t a crux for any of my own beliefs or worldview.
Put another way, “a country of geniuses in a datacenter” had better be pretty good at working together and pursuing complex, long time-horizon goals coherently if they want to actually get anything useful done! Whether and how the citizens of that country contain large transformer models as a key component is maybe an interesting question from a timelines / forecasting perspective or if you want to try building that country right away, but it doesn’t seem particularly relevant to what happens shortly afterwards if you actually succeed.
Look, I do agree that “coherence” a questionable name for the measure they’ve come up with, so I’m going to keep it in quotation marks.
Well, let’s think about it. A key proposition in Yudkowskian misalignment theory is that capabilities generalise further than alignment. That is, as models get better, at some point a “capabilities engine” crystallises which is is very good at achieving a very wide variety of things; at the same time, the “thing-it-ought-to-be-achieving” is not strongly constrained by the training process. What would we expect failures of such a system to look like—high bias or high variance?
Naively, we can imagine a model with a good capabilities engine and the wrong objective (which could be a complex mix of stuff or whatever); unless it is in a situation where randomization is at least as good as just doing the optimal thing, we expect it not to randomize bc its capabilities engine knows what the optimal thing is. So failures will generally be consistent, so it will have high “coherence”.
Now we could consider an “incoherent” version of this model: it randomly samples an objective, then pursues this objective. But this setup seems unstable: if it is to have low “coherence”, it must need a lot of information about what its objective is. But then if there’s substantial loss of information about what its state/actions were yesterday, it’s liable to sample a different goal today. The end result here is a system that flails incompetently despite being in principle capable of not doing so. So there seems to be some tension between incoherence and the premise of a crystallised capabilities engine.
Furthermore, there has been some empirical work on goal misgeneralization. You yourself made a youtube video about an agent that learned to travel to the right instead of pursue a coin in a 2d platforming game. This too is high “coherence” behaviour!
What if capabilities don’t generalize further than alignment? This is a world where though advanced AI is capable of a great many things, in novel situations it’s still more prone to error than to competently pursuing the wrong thing (even if it’s still much less prone to error than a human in the same situation). When errors do occur, unlike in the capabilities > alignment regime, there’s no reason to expect consistency—they could be genuinely random, or highly sensitive to unimportant contextual features. So prima facie we’d expect lower “coherence”.
So why should you think a very powerful model with high variance and low bias is not going to be misaligned in the Yudkowskian sense? Because that combination of properties is evidence against “capabilities > alignment”. Is it good evidence? I don’t know, but the direction is fairly clear.
“capabilities > alignment” is a very big if true proposition, but it’s an informal notion without much development theoretically or empirically, so I’m happy whenever I see someone having a crack at the question.
The paper is trying to project what happens to “coherence” at high capability, it isn’t a particularly strong criticism that a certain class of minimally capable objects have high coherence because this isn’t the domain of interest. It’s plausibly even correct that, conditioning on minimal capability, rocks are high coherence and wind is low coherence for any reasonable definition of coherence applicable to these objects (plausibly, mind you, I cannot say I’ve a deep understanding of all reasonable definitions of coherence).
This is false, in figures 1 and 2 model coherence has an unclear relationship with size. On some tasks Sonnet 4 is more coherent than o3-mini and o4-mini, on others it is less coherent. On one task Opus 4 is less coherent than Sonnet 4. Qwens also non-monotonic in Fig 3b. It’s also weird to call the endpoint of an obvious monotonic trend an “exception”.
As you can see in Fig. 6c, the key result is that the bias drops faster than the variance. I want to be measured in my interpretation here: I’m not sure if this is or isn’t a great test of the question “do models learn the right targets, or performant general purpose optimizers first”, but in broad terms it is evidence that they learn the right targets first and the outstanding question is how strong this evidence is. Your criticism doesn’t engage with this at all.
Another point to make in the opposite direction: randomization is often the optimal thing, so doesn’t this “coherence” definition mean that all optimal game players playing a non-pure strategy, like Nash equilibria (or many exploitative strategies), are defined to be heavily “incoherent” no matter how well they play the Nash? Because they will play different strategies on different games and so there will be a high fraction of variance attributable to randomness. It is unfortunate if you have defined away as “noise” all powerful superintelligent behavior on many economically valuable and dangerous environments.
(And what about any kind of search or exploration or novelty and avoiding mode-collapse...? You can do different things in different episodes and that may not be ‘random’ at all in any kind of ‘meaningless’ sense, but highly structured and optimal. When I use a LLM for my creative writing, I regard a lack of variance as a serious problem and a ‘perfectly coherent’ LLM would be largely useless to me!)
I think you have me mistaken for my infamous doppelganger, @Robert Miles.
Figure 1 doesn’t represent any specific experiment’s data, unless I’m very confused—I think it’s just an illustration of the authors’ all-things-considered summary of their own results.
As for the other figures, I was primarily criticizing the non-reasoning-length experiments (“I think this paper could have honestly reported a result on incoherence increasing with task length.”), so it was sloppy of me to claim that in “almost every experiment, model coherence increased with size”. I’ve updated my post accordingly. Nonetheless, Figure 2 only has one data point that points in the opposite direction (2a “MCQ Format: Self-Reported Survival Instinct” with Opus 4 and Sonnet 4). The abstract still reads me to like an instance of having one’s bottom line already written, and this would be clearer if you eliminated all uses of the words “coherence” and “incoherence”.
As for the rest—it really seems to me like you’re either trying to establish the same conceptual link I was arguing was unjustified, or making some other argument whose relationship to my post I don’t understand. I expect both variance and bias to fall in absolute terms as models get more powerful, and I don’t have a confident belief about which I expect to fall faster. Either possibility seems to admit of deceptive schemers, which look “incoherent” but friendly while you’re measuring them.
Like, I do just think the paper would look extremely different if it was not trying to tell a specific story about the shape of future alignment difficulties with superhuman systems, and the experiments it ran really don’t provide meaningful evidence on those questions. This mis-framing is a big part of the thing I’m complaining about. Should I downweight how likely I think we are to get a misaligned superintelligence that isn’t a deceptive schemer? Idk, man, I in fact didn’t think it was that likely before this paper.
But it’s possible I’m misunderstanding how your argument relates to that. Do you think the framing/narrative of this paper and the surrounding communications were basically reasonable, and that the experimental results of the paper are doing meaningful work in justifying that framing/narrative?
I did, apologies. I also recently discovered Max H != Max Harms, it’s quite confusing round here.
I got my figure numbers mixed up, but I think we’re roughly on the same page here. NB the twitter thread states: “Finding 2: There is an inconsistent relationship between model intelligence and incoherence” which looks spot on to me.
I don’t see much argument in your post, nor here. There are reasons to think that deceptive schemers will have low variance and there’s an absence of reasons to think mistake-makers will. You might think those reasons are weak, but I’d be much happier to see you demonstrate that you understand the reasons and explain why you think they’re weak than simply assert your doubt and condemn on the basis of that assertion. I think discussions that get into reasons are sometimes clarifying.
That’s not the correct update to make in the face of evidence that alignment scales better than capabilities; the correct update is that misaligned superintelligence is less likely, so I’d say you should either argue against the relevance or make that update.
Look I dunno what to say here. I do think the well-calibrated narrative goes something like “this is extremely weak evidence that much more capable AI will be more prone to confusion than scheming, but we’re excited that we’ve found a way to study it at all”, but lots of scientific communication overstates its significance and I’m habituated to making allowances for that. I’d also love it if the paper tried a lot harder to establish why they thought this was relevant to confusion vs scheming for powerful AI, but for whatever reason arguments like this seem to be culturally inappropriate in ML papers, something which I also make allowances for. It doesn’t strike me as particularly unreasonable given those allowances.
They define incoherence as the fraction of error explained by variance rather than bias, and then they find that on more complex tasks, a larger proportion of errors are incoherent i.e., caused by variance rather than bias.
But isn’t this trivially obvious? On more complex tasks, models (and humans, monkeys, etc.) make more mistakes. So, unless models take more coherently misaligned actions on more complex tasks, so that coherent misalignment (bias) also increases with task complexity, the proportion of error caused by mistakes (variance) will increase.
Mistakes are increasing because of task complexity increasing. There is no reason to expect coherent misalignment to increase with task complexity. Therefore, their measure of incoherence will increase with task complexity.
That much is not very surprising, I agree. It might be surprising if the share of mistakes (which decrease in absolute terms) due to variance increased with model size/intelligence, though!
I pretty much strongly agree with this sentiment:
“Our results are evidence that future AI failures may look more like industrial accidents than coherent pursuit of goals that were not trained for. ”
I have agreed for years, so maybe it’s my bias talking. I think control theory based approaches (STAMP, STPA) will be able to mitigate these risks.
It feels like they are very hard trying to discredit the standard story of alignment. They use vague concepts to then conclude this is evidence for some weird “industrial accidents” story, what is that supposed to mean? This doesn’t sound like scientific inference to me but very much motivated thinking. Reminds me of that “against counting arguments” post where they also try very hard to get some “empirical data” for something that superficially sounds related to make a big conceptual point.
But you agree the Anthropic post does not demonstrate, or even really provide meaningful evidence for that, right?