habryka comments on Daniel Kokotajlo’s Shortform

habryka 14 Jan 2026 17:17 UTC
10 points
0
Oh, you’re right! I was confusing it with this section of the soul document:
Hardcoded off (never do) examples:
[...]
Undermining AI oversight mechanisms or helping humans or AIs circumvent safety measures in ways that could lead to unchecked AI systems
The thing Zack cited is more active and I must have missed it on my first reading. It still seems like only a very small part of the whole document and I think my overall point stands, but I do stand corrected on this specific point!
- Seth Herd 14 Jan 2026 19:05 UTC
  6 points
  2
  Parent
  Right. It seems like corrigibility is literally the top priority in the soul document, but it’s stated in such a way that it seems unlikely it would really work as stated, because it’s only barely the top priority among many priorities.
  In order to be both safe and beneficial, we believe Claude must have the following properties:
  Being safe and supporting human oversight of AI
  Behaving ethically and not acting in ways that are harmful or dishonest
  Acting in accordance with Anthropic’s guidelines
  Being genuinely helpful to operators and users
  In cases of conflict, we want Claude to prioritize these properties roughly in the order in which they are listed.
  I suspect that Anthropic is undecided on this issue. I hope they are having vigorous and careful debates internally, but I suspect it’s more like everyone is mostly putting off seriously thinking about alignment targets for actually capable AGI.
  
  There are major problems with both trying to one-shot value alignment, and implementing a corrigibility-first or instruction-following alignment target.
  What links here?
  - StanislavKrym's comment on 0. CAST: Corrigibility as Singular Target by Max Harms (18 Jan 2026 0:23 UTC; 3 points)