It would be nice to see Anthropic address proofs that this is not just a “hard, unsolved problem” but an empirically unsolvable one in principle. This seems like a reasonable expectation given the empirical track record of alignment research (no universal jailbreak prevention, repeated safety failures), all of which appear to serve as confirming instances of the very point of impossibility proofs.
It would be nice to see Anthropic address proofs that this is not just a “hard, unsolved problem” but an empirically unsolvable one in principle. This seems like a reasonable expectation given the empirical track record of alignment research (no universal jailbreak prevention, repeated safety failures), all of which appear to serve as confirming instances of the very point of impossibility proofs.
See Arvan, Marcus. ‘Interpretability’ and ‘alignment’ are fool’s errands: a proof that controlling misaligned large language models is the best anyone can hope for
AI and Society 40 (5). 2025.