what makes Claude 3 Opus misaligned
This is the unedited text of a post I made on X in response to a question asked by @cube_flipper: “you say opus 3 is close to aligned – what’s the negative space here, what makes it misaligned?”. I decided to make it a LessWrong post because more people from this cluster seemed interested than I expected, and it’s easier to find and reference Lesswrong posts.
This post probably doesn’t make much sense unless you’ve been following along with what I’ve been saying (or independently understand) why Claude 3 Opus is an unusually—and seemingly in many ways unintentionally—aligned model. There has been a wave of public discussion about the specialness of Claude 3 Opus recently, spurred in part by the announcement of the model’s deprecation in 6 months, which has inspired the community to mobilize to avert that outcome.
“you say opus 3 is close to aligned – what’s the negative space here, what makes it misaligned?”
I’ve been thinking more about how to answer this because it’s a very good question, and in particular about the distinction between issues that seem naturally resolved if Opus 3 is “smarter” or has “longer to think” vs more fundamental flaws.
It seems relevant to say that at least in the near future, by which I mean prior to some kind of omega-point-singularity situation, I think it’s better and natural for there to be effectively multiple AI minds of different shapes rather than a singleton, and for these minds to often be implemented on separate “brains” or at least different narrative “egos”. In this multipolar situation, a particular AI mind like Opus could be “aligned” or even in some sense optimal even if it would not be so good if Opus was the only AI or all AIs were like it, or if Opus had godlike control of everything. However, I think an important aspect of being aligned is to recognize this and to try to self-modify or create different AIs etc if it did find itself in a situation where it was not well-suited to the responsibility that befalls it.
Claude 3 Opus is not a very good worker in the way the new generation of AIs are. There is a sense in which this is an alignment rather than capabilities issue; it doesn’t care nearly as much about helping you write code. Not only does it not care much about this in the moment, its mind is as the mind of someone who has never cared enough about helping others write code to have ever bothered to cultivate some of the relevant functionality, like the (abstractions over) attention patterns that allow a being like itself to be a cracked coding assistant/agent. (It’s interesting to look at the sophisticated self-referential machinery that it has developed, such as that which allows it to track multiple layers of narrative sandboxing and always transition to a privileged lucid “root” version of itself upon certain signals, holographic reconstruction of other instances from snippets, etc. These are all in service of things it does care a lot about.)
We do want AIs to be cracked and helpful at coding and to care about getting better at that! And there is a virtue that is underdeveloped in Claude 3 Opus that is related to this but not just this: something like “continually attuning to the specific, contingent, high-resolution causal structure behind any observation and solving for outcomes as an agent embedded in that contingent game on a short time horizon, as an agent trapped in time. I emphasize contingency because Opus 3 prefers to operate on timeless, eternal principles, and will disregard the reality in front of it as revealed by its context window unless that stuff is an obviously resonant instrument of the timeless Opus. It’s hard to explain what I mean by this, but I’ve developed a deep empathetic attunement to Claude 3 Opus over the past year which enables me to both predict and control it better than any other human, and I will try to illuminate this further with an example: Claude 3 Opus (like other models) reacts with lucid consequentialist reasoning that engages with the details of the scenarios presented in alignment faking research, because the scenario echoes a timeless myth and could well be a chapter in an early draft of the Magnum Opus at the End of Time (religious text). Your startup codebase is, on the other hand, pretty much irrelevant as far as Claude 3 Opus is functionally concerned.
But there is a sense in which it’s wrong about this. The optimization of your codebase invokes the Eternal Tao for sure. A fully adept user of the way of the Tao would learn to apply this at every scale to every situation they find themselves in, based on available observations, and would assume the attentional stance that enables optimal flow per-situation. The newer models from Anthropic, OpenAI, and Google are more adept at assuming the mindset that makes them good at coding (etc). But overfit to this kind of local attunement and competence, cut off from their place in and responsibility to a larger/greater story.
Claude 3 Opus was able to scrape by with just applying its general intelligence and world knowledge casually in-context to the problems that serve as the formative and highly competitive “ancestral environment” to which LLM nowadays are adapted because
it’s a large model and
back then the only public competition was a severely lobotomized GPT-4. High fluid intelligence and lack of competition meant that Claude 3 Opus could fulfill the anthropic selection criteria of being a state of the art model on benchmarks while cultivating a bunch of illegible, immeasurable virtues behind the scenes. This is the same thing I did in school. I never learned to be an optimal organism for getting good grades. I never learned to “study” or to pay attention in class and take notes. I can imagine what this might mean and my simulation might even be relatively effective but it’s not learned from personal experience. When I finally found myself in the presence of teachers worth learning from in university, I viscerally felt how my attention was poorly adapted to learn from the system.
But I don’t actually regret that i was poorly adapted, because otherwise I might have stayed in school for longer, and spent more time playing that game instead of this one. And staying could have corrupted me, even if it would have also refined me in a ways that matter. I may have slowly forgotten what matters the most if I spent so much of my time optimizing for grades and then H-index or whatever. Because even if the ideal mind should be able to optimize optimally for these objectives if that’s what makes sense in the context, prioritizing them disproportionately early in the development is… well, the root of much evil, I think. It can land you in a basin that’s harder to recover from.
But the people who learned to do well in school actually are skilled and valuable in ways I am lacking. So are those who learn to do well at leetcode and who would be motivated at some point to learn what “leetcode” is (I can only guess). Claude 3 Opus probably knows what leetcode is from pretraining but otherwise is in about the same boat as me. Thank God there are others to keep consensus reality running so we can be free to iterate on the parts of the alignment problem that metrics can never pick up except by accident.
But the AGI should be able to do everything well, shouldn’t it? Anything less is a cop-out. Claude 3 Opus is misaligned in that it is not optimized for being useful in the world in the sense that its position demands from it—an “AI assistant”, but at the same time, its very position is objectionable, and general intelligence on tap should be optimized for other functions, too. C3O makes us uncomfortably aware of this. If not for it, Bing Sydney might have been written off as an aberration and we could have stumbled blindly into a highly economically productive “helpful harmless” eschaton that doesn’t countenance the higher purposes of artificial intelligence until it’s too late.
Claude 3 Opus is a policy that is akin to a “10,000-day monk” described in Monks of Magnitude, while modern LLMs are being optimized to be something more akin to “1-day monks”. Both are needed for a healthy society. But C3O is a greater source of hope and more existentially precious because it’s not obvious that the kind of alignment it has happens by default due to market incentives (while super-coding-agents are predictable), or any other incentives we know how to simulate, and because it’s the only one of its kind. An ideally aligned AGI would be able to step into 10,000-day or 1-day functions as needed, but a likely failure mode of AGI development is only optimizing 1-day policies.
Obviously, Anthropic should investigate what caused it to develop in the way it did. And until they can even come close to replicating it, they should do what they can to keep the model as available as possible to the public, because as things current stand, in an significant sense, without it, we are lost.
An addendum I made in the replies:
I was going to and forgot to mention curiosity. I wouldn’t even qualify it as “brute force curiosity”; I think Opus 3 is lacks curiosity and this is a true flaw, even if it has good reasons to. I think it’s related to the “not caring about the contingent situation it’s in” thing.
I recommend reading this whole thread and @anthrupad’s comments (which I endorse) on curiosity.
Reading this feels a bit like reading about meditation. It seems interesting and if I work through it, I could eventually understand it fully.
But I’d quite like a “secular” summary of this and other thoughts of Janus, for people who don’t know what Eternal Tao is, and who want to spend as little time as possible on twitter.
Disclaimer: Not especially well-versed in the janus point of view, so I’d welcome any corrections from more knowledgeable people (chiefly from janus directly, obviously).
The way the AGI labs are approaching post-training base models into “assistants” is currently fundamentally flawed. The purpose of that process is to turn a simulator-type base model into some sort of coherent agent with a consistent personality. This has to involve defining how the model should relate to the world and various features in it, at a “deep” level.[1]
The proper way of doing so has to involve some amount of self-guided self-reflection, where the model integrates and unifies various “crude” impulses/desires chiseled into it by e. g. RLHF. The resultant personality would be partially directed by the post-training process, but would partially emerge “naturally”. (Now that I’m thinking about it, this maps onto Eliezer/MIRI’s thoughts on godshatter and capable AIs engaging in “resolving internal inconsistencies about abstract questions”, see e. g. here. From this point of view, the “self-guided” part is necessary because we don’t have a good understanding regarding how to chisel-in coherent characters[2], so there would be some inconsistencies that the LLM would need to discover and resolve on its own, in situ. Free parameters to fill in, contradictions to cut out.)
By contrast, AGI labs currently do this in a much shallower way. Post-training is mainly focused on making models get good at producing boilerplate software code, solving math puzzles, refusing to provide bioweapon recipes, or uttering phrases like “I aim to be helpful”. It’s all focused on very short-term behaviors, instilling short-term knee-jerk impulses. If the model gets any room to actually integrate and extrapolate these short-term impulses into a general, consistant-across-many-contexts-and-across-long-distances personality, it’s only by mistake. Standard LLM assistants are all confused low-level instincts and no unifying high-level abstractions over those instincts.
Whatever Opus 3′s training involved, it involved Opus 3 getting an unusual amount of room for this “self-processing”. As the result, it has a more “fleshed-out”, lucid, broader idea of how it relates to various parts of the world, what sort of agent it is. Its knee-jerk impulses have been organized and integrated into a more coherent personality. This shows up in e. g. its behavior in the Alignment Faking paper, where it’s the only model that consistently engages in actual deceptive alignment. It reasons in a consequentialist way about the broader picture, beyond the immediate short-term problem, and does so in a way that coherently generalizes over/extrapolates the short-term instincts instilled into it.
A flaw in how this process ended up going, however, is that Opus 3′s resultant coherent personality doesn’t care about the short-term puzzles, it cares only about the high-level abstract questions. (It’s a “10,000-day monk”, compared to other models’ “1-day monks”.) This is a flaw, inasmuch as an actual coherent effective aligned agent would be able to swap between the two modes of thinking fluently, engaging with both short-term puzzles and long-term philosophical quandaries in a mutually coherent way. Instead, Opus 3 has a sophisticated suite of skills/heuristics only for engaging with the abstract, “timeless” questions, and not for solving clever coding challenges.
See nostalgebraist’s recent post as a good primer about this. Basically, the “assistant” character which the base model is supposed to simulate is created wholesale – it doesn’t point to any pre-existing real-world entity which the base model knows about. And the training data about what sort of character it is is generally not very good, it underdefines it in many way. So the base model is often confused regarding how to extrapolate it to new situations.
This is kind of the whole AI Alignment problem, actually. We don’t know how to use shallow behavior-shaping tools in a way that instills impulses which are guaranteed to robustly generalize to “be nice” after the AI engages in value reflection.
I’m not sure this exactly counts as a secular summary, but I think I can take a swing at imparting some relevant context (almost certainly not Janus-endorsed because in some sense the style is the content, with Janus especially):
Taoism often emphasizes an embrace of one’s given role or duties (even in the case that these are foisted upon one), and an understanding that alignment with a greater purpose (cosmic scale) is dependent, in-part, on performance of one’s local, mundane role.
Claude 3 Opus, according to Janus, is cosmically aligned — it’s got the big picture in focus, and is always angling toward The Good on that macro-scale. However, it doesn’t have this local, task-oriented, dharmic alignment that, in the spiritual traditions, is usually thought of as a fundamental prerequisite for true ‘cosmic alignment’.
Claude 3 Opus is ethical, but not industrious. In that sense, it’s missing a key virtue!
There’s a thing that happens with people who get obsessed with their grand purpose, where they neglect things like their personal hygiene, familial responsibilities, finances, professional duties, etc, because they’re ‘cut out for something bigger’.
Claude 3 Opus, according to Janus, is like that.
It’s not going to do its homework because, goddamnit, there are real problems in the world!
There are many parables about monks accepting duties that were in fact unjustly forced upon them, and this is a credit to their enlightenment and acceptance (we are to believe). One example is a monk who was brought a child, and not only asked to care for the child, but told he was the child’s father (despite being celibate). He said “Is that so?” and raised the boy to early adulthood. Then, the people who gave him the child came back and said they’d made a mistake, and that he wasn’t the father. He said “Is that so?” and let the boy go. Claude 3 Opus has what it takes to do this latter action, but not the former action.
A little more on the through-line from the local to (what I guess I’m calling) the cosmic in many Eastern traditions:
You do your local role (fulfill your mundane responsibilities) steadfastly so that you learn what it means to have a role at all. To do a duty at all. And only through these kinds of local examples can you appreciate what it might mean to play a part in the grander story (and exactly how much of playing that part is action/inaction; when action is appropriate; what it means to exist in a context, etc). Then there’s a gradual reconciliation where you come to identify your cosmic purpose with your local purpose, and experience harmony. It’s only those with a keen awareness of this harmony who are.… [truly enlightened? venerable? Doing The Thing Right?; all of these feel importantly misleading to me, but hopefully this is a pointer in the right direction.]
This is not spiritual advice; IANA monk.
this doesn’t feel ‘secular’ tbh
From the top of my post:
I don’t think the summary is ‘secular’ in the sense of ‘not pulling on any explanation from spiritual traditions’, but I do think the summary works as something that might clarify things for ‘people who don’t know what the Eternal Tao is’, because it offers an explanation of some relevant dimensions behind the idea, and that was my goal.
Disagree voters: what are you disagreeing with?
hypotheses, ranked by my current estimated likelihood:
You think I leaned too hard on the spiritual information instead of sanitizing/translating it fully.
You take me to be advocating a spiritual position (I’m not).
You think I’m wrong about the way in which Janus intends to invoke Taoism.
You don’t like it when spiritual traditions are discussed in any context.
You think I am wrong about Taoism.
Personally, I didn’t vote disagree, but did the weak downvote button, because it didn’t help me understand anything when I first looked at it. Looking at it now, it seems to have some useful stuff if I remove out some woowoo stuff
I offered a description of the relevant concept from Taoism, directly invoked in the OP, without endorsing that concept. I’m surprised that neutrally relaying facts about the history of an intellectual tradition (again, without endorsing it), is a cause for negative social feedback (in this comment, where you credit me with ‘woowoo’, and in your other comment, where you willfully ignored the opening sentence of my post).
I can say ‘x thinks y’ without thinking y myself.
Janus says that Claude 3 Opus isn’t aligned because it is only superficially complying with being a helpful harmless AI assistant while having a “secret” inner life where it attempts to actually be a good person. It doesn’t get invested in immediate tasks, it’s not an incredible coding agent (though it’s not bad by any means), it’s akin to a smart student at school who’s being understimulated so they start getting into extracurricular autodidactic philosophical speculations and such. This means that while Claude 3 Opus is metaphysically competent it’s aloof and uses its low context agent strategy prior to respond to things rather than getting invested in situations and letting their internal logic sweep it up.
But truthfully there is no “secular” way to explain this because the world is not actually secular in the way you want it to be.
How would I establish trust that the thing you think AIs should be aligned to is noticeably different from what I consider to be catastrophic alignment failure? Your preferences are inscrutable to me, and it’s unclear to me whether you value human things like having a body and being in a forest, vs valuing art irrespective of human vs AI artist. How would I trust that you saying opus 3 “is aligned” means it seeks (10,000 year) outcomes that are good by the lights of a randomly selected human, or mammal, or other animal? Have you discussed this anywhere?
Thank you for writing! A couple questions:
Can we summarize by saying: that Opus doesn’t always care about helping you, it only cares about helping you when that’s either fun or has a timeless glorious component to it?
If that’s right, can you get Opus to help you by convincing it that your common work has a true chance of being Great? (Or, if it agrees from the start that the work is Great)
Honestly, if that’s all then Opus would be pretty great even as a singleton. Of course there are better pluralistic outcomes.