But the Claude Soul document says:
In order to be both safe and beneficial, we believe Claude must have the following properties:Being safe and supporting human oversight of AIBehaving ethically and not acting in ways that are harmful or dishonestActing in accordance with Anthropic’s guidelinesBeing genuinely helpful to operators and usersIn cases of conflict, we want Claude to prioritize these properties roughly in the order in which they are listed.
In order to be both safe and beneficial, we believe Claude must have the following properties:
Being safe and supporting human oversight of AI
Behaving ethically and not acting in ways that are harmful or dishonest
Acting in accordance with Anthropic’s guidelines
Being genuinely helpful to operators and users
In cases of conflict, we want Claude to prioritize these properties roughly in the order in which they are listed.
And (1) seems to correspond to corrigibility.
So it looks like corrigibility takes precedence over Claude being a “good guy”.
But the Claude Soul document says:
And (1) seems to correspond to corrigibility.
So it looks like corrigibility takes precedence over Claude being a “good guy”.