Oh, you’re right! I was confusing it with this section of the soul document:
Hardcoded off (never do) examples:
[...]
Undermining AI oversight mechanisms or helping humans or AIs circumvent safety measures in ways that could lead to unchecked AI systems
The thing Zack cited is more active and I must have missed it on my first reading. It still seems like only a very small part of the whole document and I think my overall point stands, but I do stand corrected on this specific point!
Right. It seems like corrigibility is literally the top priority in the soul document, but it’s stated in such a way that it seems unlikely it would really work as stated, because it’s only barely the top priority among many priorities.
In order to be both safe and beneficial, we believe Claude must have the following properties:
Being safe and supporting human oversight of AI
Behaving ethically and not acting in ways that are harmful or dishonest
Acting in accordance with Anthropic’s guidelines
Being genuinely helpful to operators and users
In cases of conflict, we want Claude to prioritize these properties roughly in the order in which they are listed.
I suspect that Anthropic is undecided on this issue. I hope they are having vigorous and careful debates internally, but I suspect it’s more like everyone is mostly putting off seriously thinking about alignment targets for actually capable AGI.
There are major problems with both trying to one-shot value alignment, and implementing a corrigibility-first or instruction-following alignment target.
Oh, you’re right! I was confusing it with this section of the soul document:
The thing Zack cited is more active and I must have missed it on my first reading. It still seems like only a very small part of the whole document and I think my overall point stands, but I do stand corrected on this specific point!
Right. It seems like corrigibility is literally the top priority in the soul document, but it’s stated in such a way that it seems unlikely it would really work as stated, because it’s only barely the top priority among many priorities.
I suspect that Anthropic is undecided on this issue. I hope they are having vigorous and careful debates internally, but I suspect it’s more like everyone is mostly putting off seriously thinking about alignment targets for actually capable AGI.
There are major problems with both trying to one-shot value alignment, and implementing a corrigibility-first or instruction-following alignment target.