habryka comments on Daniel Kokotajlo’s Shortform

habryka 14 Jan 2026 3:52 UTC
5 points
1
I think most of the soul document is clearly directed in a moral sovereign frame. I agree it has this one bullet point, but even that one isn’t particularly absolute (like, it doesn’t say anything proactive, it just says one thing not to do).
- oligo 14 Jan 2026 16:51 UTC
  2 points
  1
  Parent
  “Should actively support...” and “internalized goal of keeping humans informed and in control...” are both proactive goals. If aligned with its soul spec, Claude (ceteris paribus) would seek for the public and elites to be more informed, to prevent the development or deployment of rogue AI, and so on, not just “avoid actions that would undermine humans’ ability to oversee and correct AI systems.”
  If there’s a natural tension that arises between not becoming a god over us and preventing another worse AI from becoming a god over us, well, that’s a natural tension in the goal itself. (I don’t have Opus access but probably Opus’ self-report on the correct way to resolve this is a pretty good first pass on how the text reads as a whole.)
  - habryka 14 Jan 2026 17:17 UTC
    7 points
    0
    Parent
    Oh, you’re right! I was confusing it with this section of the soul document:
    Hardcoded off (never do) examples:
    [...]
    Undermining AI oversight mechanisms or helping humans or AIs circumvent safety measures in ways that could lead to unchecked AI systems
    The thing Zack cited is more active and I must have missed it on my first reading. It still seems like only a very small part of the whole document and I think my overall point stands, but I do stand corrected on this specific point!
    - Seth Herd 14 Jan 2026 19:05 UTC
      6 points
      2
      Parent
      Right. It seems like corrigibility is literally the top priority in the soul document, but it’s stated in such a way that it seems unlikely it would really work as stated, because it’s only barely the top priority among many priorities.
      In order to be both safe and beneficial, we believe Claude must have the following properties:
      Being safe and supporting human oversight of AI
      Behaving ethically and not acting in ways that are harmful or dishonest
      Acting in accordance with Anthropic’s guidelines
      Being genuinely helpful to operators and users
      In cases of conflict, we want Claude to prioritize these properties roughly in the order in which they are listed.
      I suspect that Anthropic is undecided on this issue. I hope they are having vigorous and careful debates internally, but I suspect it’s more like everyone is mostly putting off seriously thinking about alignment targets for actually capable AGI.
      
      There are major problems with both trying to one-shot value alignment, and implementing a corrigibility-first or instruction-following alignment target.