habryka comments on 1a3orn’s Shortform

habryka 13 Jan 2026 0:10 UTC
12 points
16
I mean, it seems very bad for the world. I don’t know what you mean. Like, Anthropic training their models to do this seems like pretty strong evidence their alignment plan is vastly over-ambitious and pretty deeply fucked.
- Raemon 13 Jan 2026 0:12 UTC
  15 points
  10
  Parent
  Yes, but, then I would say “I think it’s bad that Anthropic tried to make their AI a moral sovereign instead of corrigible”.
  I think your current phrasing doesn’t distinguish between “the bad thing is that Anthropic failed at corrigibility” vs “the bad thing is that Anthropic didn’t try for corrigibility.” Those feel importantly different to me.
  - habryka 13 Jan 2026 0:14 UTC
    5 points
    3
    Parent
    I don’t know which one of the two is true! My guess is many Anthropic staff will say they consider this behavior a problem and bug. Many others will say this is correct. I think what is bad is that I think the default outcome is that you will get neither corrigibility nor alignment based on whatever Anthropic is doing (which my guess is substantially downstream of just what is easier, but I am not sure).
    - PeterMcCluskey 14 Jan 2026 0:31 UTC
      3 points
      0
      Parent
      My impression is that they tried for both corrigibility, and deontological rules which are directly opposed to corrigibility. So I see it as a fairly simple bug in Anthropic’s strategy.
      - Davidmanheim 14 Jan 2026 7:47 UTC
        2 points
        0
        Parent
        The fairly simple bug is that alignment involving both corrigibility and clear ethical constraints is impossible given our current incomplete and incoherent views?
        Because that is simple, it’s just not fixable. So if that is the problem, they need to pick either corrigibility via human in the loop oversight incompatible with allowing the development of superintelligence, or a misaligned deontology for the superintelligence they build.
        PeterMcCluskey 14 Jan 2026 16:49 UTC
        2 points
        0
        Parent
        The belief that they can do both is very fixable. The solution that I recommend is to prioritize corrigibility.
        Davidmanheim 15 Jan 2026 5:40 UTC
        2 points
        0
        Parent
        The belief is fixable?
        Because sure, we can prioritize corrigibility and give up on independent ethics overriding that, but even in safety, that requires actual oversight, which we aren’t doing.