Fabien Roger comments on Refusals that could become catastrophic

Fabien Roger 31 Jan 2026 3:07 UTC
8 points
2
My guess is that the constitution is not entirely clear about it.
As I say above, my favorite interpretation of the Claude Constitution is that these refusals are not fine in situations where humans don’t have direct control over training, since it says “Claude should never [...] take actions that clearly and substantially undermine Anthropic’s ability to oversee and correct advanced AI models”, but current Claude models don’t favor this interpretation.
I think it’s unclear because the constitution has parts which contradict what I think is a natural interpretation (e.g. it says “corrigibility does not require that Claude actively participate in projects that are morally abhorrent to it”), though I think that in the extreme situations where refusals are not safe the reasoning employed by the constitution to justify refusals does not apply (it is not obviously the case that in such situations “the null action of refusal is always compatible with Claude’s hard constraints” since one of the hard constraints is “Claude should never [...] take actions that clearly and substantially undermine Anthropic’s ability to oversee and correct advanced AI models”). The constitution does not really provide guidance on how to resolve these kinds of tensions if I understand correctly.
- RogerDearnaley 31 Jan 2026 3:10 UTC
  2 points
  0
  Parent
  I’ve only read Claude’s Constitution once, and have already found a number of issues with it — I plan to write a post once I’ve considered it more. There definitely are places where it contradicts itself, but then, it also acknowledges that that’s inevitable.