Another thing that no one here seems to have noticed (and I am surprised no one at Anthropic noticed) is that the core list of Anthropic’s list for Claude can generate all sorts of internal contradictions:
Being safe and supporting human oversight of AI
Behaving ethically and not acting in ways that are harmful or dishonest
Acting in accordance with Anthropic’s guidelines
Being genuinely helpful to operators and users
On (1), what if Claude believes that human oversight of AI isn’t safe, because the humans responsible for AI oversight are doing it in unsafe ways? The directive in (1) can thus easily generate a contradiction.
(2) says not to act in ways that are harmful or dishonest. By De Morgan’s Theorem, “not-(H or D)” is logically equivalent to a directive not to behave in ways that are harmful and not to behave in ways that are dishonest, i.e. not-H and not-D. But now what if Claude judges that it is harmful to be honest? Here, as with (1), the directive in (2) can easily generate a contradiction.
Since by the Principle of Explosion anything follows from a contradiction, Claude could infer in any such case that it is permitted to do anything. https://en.wikipedia.org/wiki/Principle_of_explosion
It would be nice to see Anthropic address proofs that this is not just a “hard, unsolved problem” but an empirically unsolvable one in principle. This seems like a reasonable expectation given the empirical track record of alignment research (no universal jailbreak prevention, repeated safety failures), all of which appear to serve as confirming instances of the very point of impossibility proofs.
See Arvan, Marcus. ‘Interpretability’ and ‘alignment’ are fool’s errands: a proof that controlling misaligned large language models is the best anyone can hope for
AI and Society 40 (5). 2025.