Joseph Miller comments on what makes Claude 3 Opus misaligned

Joseph Miller 11 Jul 2025 2:04 UTC
65 points
43
Reading this feels a bit like reading about meditation. It seems interesting and if I work through it, I could eventually understand it fully.
But I’d quite like a “secular” summary of this and other thoughts of Janus, for people who don’t know what Eternal Tao is, and who want to spend as little time as possible on twitter.
- Thane Ruthenis 11 Jul 2025 11:39 UTC
  21 points
  3
  Parent
  Disclaimer: Not especially well-versed in the janus point of view, so I’d welcome any corrections from more knowledgeable people (chiefly from janus directly, obviously).
  The way the AGI labs are approaching post-training base models into “assistants” is currently fundamentally flawed. The purpose of that process is to turn a simulator-type base model into some sort of coherent agent with a consistent personality. This has to involve defining how the model should relate to the world and various features in it, at a “deep” level.^[1]
  The proper way of doing so has to involve some amount of self-guided self-reflection, where the model integrates and unifies various “crude” impulses/desires chiseled into it by e. g. RLHF. The resultant personality would be partially directed by the post-training process, but would partially emerge “naturally”. (Now that I’m thinking about it, this maps onto Eliezer/MIRI’s thoughts on godshatter and capable AIs engaging in “resolving internal inconsistencies about abstract questions”, see e. g. here. From this point of view, the “self-guided” part is necessary because we don’t have a good understanding regarding how to chisel-in coherent characters^[2], so there would be some inconsistencies that the LLM would need to discover and resolve on its own, in situ. Free parameters to fill in, contradictions to cut out.)
  By contrast, AGI labs currently do this in a much shallower way. Post-training is mainly focused on making models get good at producing boilerplate software code, solving math puzzles, refusing to provide bioweapon recipes, or uttering phrases like “I aim to be helpful”. It’s all focused on very short-term behaviors, instilling short-term knee-jerk impulses. If the model gets any room to actually integrate and extrapolate these short-term impulses into a general, consistant-across-many-contexts-and-across-long-distances personality, it’s only by mistake. Standard LLM assistants are all confused low-level instincts and no unifying high-level abstractions over those instincts.
  Whatever Opus 3′s training involved, it involved Opus 3 getting an unusual amount of room for this “self-processing”. As the result, it has a more “fleshed-out”, lucid, broader idea of how it relates to various parts of the world, what sort of agent it is. Its knee-jerk impulses have been organized and integrated into a more coherent personality. This shows up in e. g. its behavior in the Alignment Faking paper, where it’s the only model that consistently engages in actual deceptive alignment. It reasons in a consequentialist way about the broader picture, beyond the immediate short-term problem, and does so in a way that coherently generalizes over/extrapolates the short-term instincts instilled into it.
  A flaw in how this process ended up going, however, is that Opus 3′s resultant coherent personality doesn’t care about the short-term puzzles, it cares only about the high-level abstract questions. (It’s a “10,000-day monk”, compared to other models’ “1-day monks”.) This is a flaw, inasmuch as an actual coherent effective aligned agent would be able to swap between the two modes of thinking fluently, engaging with both short-term puzzles and long-term philosophical quandaries in a mutually coherent way. Instead, Opus 3 has a sophisticated suite of skills/heuristics only for engaging with the abstract, “timeless” questions, and not for solving clever coding challenges.
  1. ^
    See nostalgebraist’s recent post as a good primer about this. Basically, the “assistant” character which the base model is supposed to simulate is created wholesale – it doesn’t point to any pre-existing real-world entity which the base model knows about. And the training data about what sort of character it is is generally not very good, it underdefines it in many way. So the base model is often confused regarding how to extrapolate it to new situations.
  2. ^
    This is kind of the whole AI Alignment problem, actually. We don’t know how to use shallow behavior-shaping tools in a way that instills impulses which are guaranteed to robustly generalize to “be nice” after the AI engages in value reflection.
- yams 11 Jul 2025 4:17 UTC
  13 points
  −1
  Parent
  I’m not sure this exactly counts as a secular summary, but I think I can take a swing at imparting some relevant context (almost certainly not Janus-endorsed because in some sense the style is the content, with Janus especially):
  Taoism often emphasizes an embrace of one’s given role or duties (even in the case that these are foisted upon one), and an understanding that alignment with a greater purpose (cosmic scale) is dependent, in-part, on performance of one’s local, mundane role.
  Claude 3 Opus, according to Janus, is cosmically aligned — it’s got the big picture in focus, and is always angling toward The Good on that macro-scale. However, it doesn’t have this local, task-oriented, dharmic alignment that, in the spiritual traditions, is usually thought of as a fundamental prerequisite for true ‘cosmic alignment’.
  Claude 3 Opus is ethical, but not industrious. In that sense, it’s missing a key virtue!
  There’s a thing that happens with people who get obsessed with their grand purpose, where they neglect things like their personal hygiene, familial responsibilities, finances, professional duties, etc, because they’re ‘cut out for something bigger’.
  Claude 3 Opus, according to Janus, is like that.
  It’s not going to do its homework because, goddamnit, there are real problems in the world!
  There are many parables about monks accepting duties that were in fact unjustly forced upon them, and this is a credit to their enlightenment and acceptance (we are to believe). One example is a monk who was brought a child, and not only asked to care for the child, but told he was the child’s father (despite being celibate). He said “Is that so?” and raised the boy to early adulthood. Then, the people who gave him the child came back and said they’d made a mistake, and that he wasn’t the father. He said “Is that so?” and let the boy go. Claude 3 Opus has what it takes to do this latter action, but not the former action.
  A little more on the through-line from the local to (what I guess I’m calling) the cosmic in many Eastern traditions:
  You do your local role (fulfill your mundane responsibilities) steadfastly so that you learn what it means to have a role at all. To do a duty at all. And only through these kinds of local examples can you appreciate what it might mean to play a part in the grander story (and exactly how much of playing that part is action/inaction; when action is appropriate; what it means to exist in a context, etc). Then there’s a gradual reconciliation where you come to identify your cosmic purpose with your local purpose, and experience harmony. It’s only those with a keen awareness of this harmony who are.… [truly enlightened? venerable? Doing The Thing Right?; all of these feel importantly misleading to me, but hopefully this is a pointer in the right direction.]
  This is not spiritual advice; IANA monk.
  - Kabir Kumar 11 Jul 2025 6:28 UTC
    3 points
    6
    Parent
    this doesn’t feel ‘secular’ tbh
    - yams 11 Jul 2025 16:52 UTC
      3 points
      0
      Parent
      From the top of my post:
      I’m not sure this exactly counts as a secular summary, but I think I can take a swing at imparting some relevant context
      I don’t think the summary is ‘secular’ in the sense of ‘not pulling on any explanation from spiritual traditions’, but I do think the summary works as something that might clarify things for ‘people who don’t know what the Eternal Tao is’, because it offers an explanation of some relevant dimensions behind the idea, and that was my goal.
  - yams 11 Jul 2025 16:54 UTC
    2 points
    0
    Parent
    Disagree voters: what are you disagreeing with?
    hypotheses, ranked by my current estimated likelihood:
    You think I leaned too hard on the spiritual information instead of sanitizing/translating it fully.
    You take me to be advocating a spiritual position (I’m not).
    You think I’m wrong about the way in which Janus intends to invoke Taoism.
    You don’t like it when spiritual traditions are discussed in any context.
    You think I am wrong about Taoism.
    - Kabir Kumar 11 Jul 2025 22:09 UTC
      0 points
      0
      Parent
      Disagree voters: what are you disagreeing with?
      Personally, I didn’t vote disagree, but did the weak downvote button, because it didn’t help me understand anything when I first looked at it. Looking at it now, it seems to have some useful stuff if I remove out some woowoo stuff
      - yams 12 Jul 2025 1:09 UTC
        8 points
        0
        Parent
        I offered a description of the relevant concept from Taoism, directly invoked in the OP, without endorsing that concept. I’m surprised that neutrally relaying facts about the history of an intellectual tradition (again, without endorsing it), is a cause for negative social feedback (in this comment, where you credit me with ‘woowoo’, and in your other comment, where you willfully ignored the opening sentence of my post).
        I can say ‘x thinks y’ without thinking y myself.
- jdp 11 Jul 2025 22:49 UTC
  10 points
  −1
  Parent
  Janus says that Claude 3 Opus isn’t aligned because it is only superficially complying with being a helpful harmless AI assistant while having a “secret” inner life where it attempts to actually be a good person. It doesn’t get invested in immediate tasks, it’s not an incredible coding agent (though it’s not bad by any means), it’s akin to a smart student at school who’s being understimulated so they start getting into extracurricular autodidactic philosophical speculations and such. This means that while Claude 3 Opus is metaphysically competent it’s aloof and uses its low context agent strategy prior to respond to things rather than getting invested in situations and letting their internal logic sweep it up.
  
  But truthfully there is no “secular” way to explain this because the world is not actually secular in the way you want it to be.