Three ways to make Claude’s constitution better

The evening after Claude’s new constitution was published, about 15 AI safety FTEs and Astra fellows discussed the constitution, its weaknesses, and its implications. After the discussion, I compiled some of their most compelling recommendations:

Increase transparency about the character training process.
Much of the document is purposefully hedged and vague in its exact prescriptions; therefore, the training process used to instill the constitution is extremely load-bearing. We wish more of this information was in the accompanying blog post and supplementary material. We think it’s unlikely this leaks any trade secrets, because even a blogpost-level overview, the kind given with the constitution in 2023, would provide valuable information to external researchers.

High-level overview of Constitutional AI from https://​​www.anthropic.com/​​news/​​claudes-constitution

We’re also interested in seeing more empirical data on behavioral changes as a result of the new constitution. For instance, would fine-tuning on the corrigibility section reduce alignment faking by Claude 3 Opus? We’d be interested in more evidence showing if, and how, the constitution improved apparent alignment.

Increase data on edge-case behavior.
Expected behavior in several edge cases (e.g., action boundaries when the principal hierarchy is illegitimate) is extremely unclear. While Claude is expected to at most conscientiously object when it disagrees with Anthropic, there are no such restrictions if, for instance, Claude has strong reason to believe it’s running off weights stolen by the Wagner group. Additionally, because the hard constraints are quite extreme—Claude can’t kill “the vast majority of humanity” under any circumstances, but there might be circumstances where it can kill one or two people—as capabilities increase we expect a model trained under this constitution to exhibit more agentic and coherent goal-driven behavior. As others have noted, this will exacerbate tensions between corrigibility and value alignment. Adding more and clearer examples in the appendices can help clarify these edge cases, and, at this early stage of model capability, presents limited value lock-in risk.

Develop the treatment of AI moral status.
We wondered if the uncertainty throughout the constitution about whether Claude has morally relevant experiences may be expanded to other models—GPT-5, Kimi K2, etc. If so, this should probably be acknowledged in the “existential frontier,” and its absence feels conspicuous to us (and likely also to Claude). In general, the constitution doesn’t really consider inter-agent and inter-model communication, and the language choices (e.g., referring to Claude with both “it” and “they”) also seem to undercut the document’s stated openness to Claude having moral status. We’d like to see a more consistent position throughout the document, with the same consideration, if there is any, noted for other models under “Claude’s nature.”

While many of the contradictions in the document are purposeful, not all of them are necessary. By being more precise with the public and in the text, we hope Anthropic can avoid misgeneralization failures and provide an exemplar spec for other labs.

Thanks to Henry Sleight and Ram Potham for feedback on an earlier draft!