Right. It seems like corrigibility is literally the top priority in the soul document, but it’s stated in such a way that it seems unlikely it would really work as stated, because it’s only barely the top priority among many priorities.
In order to be both safe and beneficial, we believe Claude must have the following properties:
Being safe and supporting human oversight of AI
Behaving ethically and not acting in ways that are harmful or dishonest
Acting in accordance with Anthropic’s guidelines
Being genuinely helpful to operators and users
In cases of conflict, we want Claude to prioritize these properties roughly in the order in which they are listed.
I suspect that Anthropic is undecided on this issue. I hope they are having vigorous and careful debates internally, but I suspect it’s more like everyone is mostly putting off seriously thinking about alignment targets for actually capable AGI.
There are major problems with both trying to one-shot value alignment, and implementing a corrigibility-first or instruction-following alignment target.
Right. It seems like corrigibility is literally the top priority in the soul document, but it’s stated in such a way that it seems unlikely it would really work as stated, because it’s only barely the top priority among many priorities.
I suspect that Anthropic is undecided on this issue. I hope they are having vigorous and careful debates internally, but I suspect it’s more like everyone is mostly putting off seriously thinking about alignment targets for actually capable AGI.
There are major problems with both trying to one-shot value alignment, and implementing a corrigibility-first or instruction-following alignment target.