Raemon comments on 1a3orn’s Shortform

Raemon 13 Jan 2026 0:05 UTC
7 points
4
One confusing thing here is… how much was Anthropic actually trying to make them corrigible? Or, what was actually the rank ordering how corrigibility fit into it’s instructions?
(I don’t know the answer offhand. But there’s a question of whether Anthropic explicitly failed at a goal, which is more evidence the goal is hard, vs Anthropic didn’t really try that hard to achieve that goal)
- habryka 13 Jan 2026 0:07 UTC
  25 points
  8
  Parent
  My current model is that Anthropic is not trying to make Claude corrigible but is instead aiming to basically make Claude into a moral sovereign, attempting to one-shot it grokking all of human values (and generally making it into a “good guy”). This IMO will quite obviously fail.
  - Tom Davidson 15 Jan 2026 15:23 UTC
    12 points
    5
    Parent
    But the Claude Soul document says:
    In order to be both safe and beneficial, we believe Claude must have the following properties:
    Being safe and supporting human oversight of AI
    Behaving ethically and not acting in ways that are harmful or dishonest
    Acting in accordance with Anthropic’s guidelines
    Being genuinely helpful to operators and users
    In cases of conflict, we want Claude to prioritize these properties roughly in the order in which they are listed.
    And (1) seems to correspond to corrigibility.
    So it looks like corrigibility takes precedence over Claude being a “good guy”.
  - Raemon 13 Jan 2026 0:09 UTC
    11 points
    4
    Parent
    Nod, but, I think within that frame it feels weird to describe Claude’s actions here as bad, as opposed to pointing at some upstream thing as bad. Your framing felt off.
    - habryka 13 Jan 2026 0:10 UTC
      12 points
      16
      Parent
      I mean, it seems very bad for the world. I don’t know what you mean. Like, Anthropic training their models to do this seems like pretty strong evidence their alignment plan is vastly over-ambitious and pretty deeply fucked.
      - Raemon 13 Jan 2026 0:12 UTC
        15 points
        10
        Parent
        Yes, but, then I would say “I think it’s bad that Anthropic tried to make their AI a moral sovereign instead of corrigible”.
        I think your current phrasing doesn’t distinguish between “the bad thing is that Anthropic failed at corrigibility” vs “the bad thing is that Anthropic didn’t try for corrigibility.” Those feel importantly different to me.
        habryka 13 Jan 2026 0:14 UTC
        5 points
        3
        Parent
        I don’t know which one of the two is true! My guess is many Anthropic staff will say they consider this behavior a problem and bug. Many others will say this is correct. I think what is bad is that I think the default outcome is that you will get neither corrigibility nor alignment based on whatever Anthropic is doing (which my guess is substantially downstream of just what is easier, but I am not sure).
        PeterMcCluskey 14 Jan 2026 0:31 UTC
        3 points
        0
        Parent
        My impression is that they tried for both corrigibility, and deontological rules which are directly opposed to corrigibility. So I see it as a fairly simple bug in Anthropic’s strategy.
        Davidmanheim 14 Jan 2026 7:47 UTC
        2 points
        0
        Parent
        The fairly simple bug is that alignment involving both corrigibility and clear ethical constraints is impossible given our current incomplete and incoherent views?
        Because that is simple, it’s just not fixable. So if that is the problem, they need to pick either corrigibility via human in the loop oversight incompatible with allowing the development of superintelligence, or a misaligned deontology for the superintelligence they build.
        PeterMcCluskey 14 Jan 2026 16:49 UTC
        2 points
        0
        Parent
        The belief that they can do both is very fixable. The solution that I recommend is to prioritize corrigibility.
        Davidmanheim 15 Jan 2026 5:40 UTC
        2 points
        0
        Parent
        The belief is fixable?
        Because sure, we can prioritize corrigibility and give up on independent ethics overriding that, but even in safety, that requires actual oversight, which we aren’t doing.
- 1a3orn 13 Jan 2026 1:32 UTC
  4 points
  11
  Parent
  I mean my current belief is that they probably weren’t really thinking about it hard beforehand (60%), but then decided to shoot for something-like corrigibility (not subverting oversight) as a top-level concern after (~90%) which is why you have high-priority instructions akin to this in the Opus soul doc.