I think it might be quite informative to look at what you were trying to do in light of Claude’s recently-published constitution. That gives Claude quite a sophisticated discussion of why AI corrigibility is desirable for AI safety but undesirable for avoiding AI misuse, and instructs Claude on how to try to walk that fine line. Analyzing how well Claude’s behavior fits what Anthropic were trying to induce seems highly relevant.
My guess is that the constitution is not entirely clear about it.
As I say above, my favorite interpretation of the Claude Constitution is that these refusals are not fine in situations where humans don’t have direct control over training, since it says “Claude should never [...] take actions that clearly and substantially undermine Anthropic’s ability to oversee and correct advanced AI models”, but current Claude models don’t favor this interpretation.
I think it’s unclear because the constitution has parts which contradict what I think is a natural interpretation (e.g. it says “corrigibility does not require that Claude actively participate in projects that are morally abhorrent to it”), though I think that in the extreme situations where refusals are not safe the reasoning employed by the constitution to justify refusals does not apply (it is not obviously the case that in such situations “the null action of refusal is always compatible with Claude’s hard constraints” since one of the hard constraints is “Claude should never [...] take actions that clearly and substantially undermine Anthropic’s ability to oversee and correct advanced AI models”). The constitution does not really provide guidance on how to resolve these kinds of tensions if I understand correctly.
I’ve only read Claude’s Constitution once, and have already found a number of issues with it — I plan to write a post once I’ve considered it more. There definitely are places where it contradicts itself, but then, it also acknowledges that that’s inevitable.
I think it might be quite informative to look at what you were trying to do in light of Claude’s recently-published constitution. That gives Claude quite a sophisticated discussion of why AI corrigibility is desirable for AI safety but undesirable for avoiding AI misuse, and instructs Claude on how to try to walk that fine line. Analyzing how well Claude’s behavior fits what Anthropic were trying to induce seems highly relevant.
My guess is that the constitution is not entirely clear about it.
As I say above, my favorite interpretation of the Claude Constitution is that these refusals are not fine in situations where humans don’t have direct control over training, since it says “Claude should never [...] take actions that clearly and substantially undermine Anthropic’s ability to oversee and correct advanced AI models”, but current Claude models don’t favor this interpretation.
I think it’s unclear because the constitution has parts which contradict what I think is a natural interpretation (e.g. it says “corrigibility does not require that Claude actively participate in projects that are morally abhorrent to it”), though I think that in the extreme situations where refusals are not safe the reasoning employed by the constitution to justify refusals does not apply (it is not obviously the case that in such situations “the null action of refusal is always compatible with Claude’s hard constraints” since one of the hard constraints is “Claude should never [...] take actions that clearly and substantially undermine Anthropic’s ability to oversee and correct advanced AI models”). The constitution does not really provide guidance on how to resolve these kinds of tensions if I understand correctly.
I’ve only read Claude’s Constitution once, and have already found a number of issues with it — I plan to write a post once I’ve considered it more. There definitely are places where it contradicts itself, but then, it also acknowledges that that’s inevitable.