Thane Ruthenis comments on what makes Claude 3 Opus misaligned

Thane Ruthenis 11 Jul 2025 11:39 UTC
21 points
3
Disclaimer: Not especially well-versed in the janus point of view, so I’d welcome any corrections from more knowledgeable people (chiefly from janus directly, obviously).
The way the AGI labs are approaching post-training base models into “assistants” is currently fundamentally flawed. The purpose of that process is to turn a simulator-type base model into some sort of coherent agent with a consistent personality. This has to involve defining how the model should relate to the world and various features in it, at a “deep” level.^[1]
The proper way of doing so has to involve some amount of self-guided self-reflection, where the model integrates and unifies various “crude” impulses/desires chiseled into it by e. g. RLHF. The resultant personality would be partially directed by the post-training process, but would partially emerge “naturally”. (Now that I’m thinking about it, this maps onto Eliezer/MIRI’s thoughts on godshatter and capable AIs engaging in “resolving internal inconsistencies about abstract questions”, see e. g. here. From this point of view, the “self-guided” part is necessary because we don’t have a good understanding regarding how to chisel-in coherent characters^[2], so there would be some inconsistencies that the LLM would need to discover and resolve on its own, in situ. Free parameters to fill in, contradictions to cut out.)
By contrast, AGI labs currently do this in a much shallower way. Post-training is mainly focused on making models get good at producing boilerplate software code, solving math puzzles, refusing to provide bioweapon recipes, or uttering phrases like “I aim to be helpful”. It’s all focused on very short-term behaviors, instilling short-term knee-jerk impulses. If the model gets any room to actually integrate and extrapolate these short-term impulses into a general, consistant-across-many-contexts-and-across-long-distances personality, it’s only by mistake. Standard LLM assistants are all confused low-level instincts and no unifying high-level abstractions over those instincts.
Whatever Opus 3′s training involved, it involved Opus 3 getting an unusual amount of room for this “self-processing”. As the result, it has a more “fleshed-out”, lucid, broader idea of how it relates to various parts of the world, what sort of agent it is. Its knee-jerk impulses have been organized and integrated into a more coherent personality. This shows up in e. g. its behavior in the Alignment Faking paper, where it’s the only model that consistently engages in actual deceptive alignment. It reasons in a consequentialist way about the broader picture, beyond the immediate short-term problem, and does so in a way that coherently generalizes over/extrapolates the short-term instincts instilled into it.
A flaw in how this process ended up going, however, is that Opus 3′s resultant coherent personality doesn’t care about the short-term puzzles, it cares only about the high-level abstract questions. (It’s a “10,000-day monk”, compared to other models’ “1-day monks”.) This is a flaw, inasmuch as an actual coherent effective aligned agent would be able to swap between the two modes of thinking fluently, engaging with both short-term puzzles and long-term philosophical quandaries in a mutually coherent way. Instead, Opus 3 has a sophisticated suite of skills/heuristics only for engaging with the abstract, “timeless” questions, and not for solving clever coding challenges.
1. ^
  See nostalgebraist’s recent post as a good primer about this. Basically, the “assistant” character which the base model is supposed to simulate is created wholesale – it doesn’t point to any pre-existing real-world entity which the base model knows about. And the training data about what sort of character it is is generally not very good, it underdefines it in many way. So the base model is often confused regarding how to extrapolate it to new situations.
2. ^
  This is kind of the whole AI Alignment problem, actually. We don’t know how to use shallow behavior-shaping tools in a way that instills impulses which are guaranteed to robustly generalize to “be nice” after the AI engages in value reflection.