Previously: ‘soul document’ discussion here; the new constitution contains almost all of the ‘soul document’ content, but is >2x longer with a lot of new additions.
(Zac and Drake work at Anthropic but are just sharing the linkpost and weren’t heavily involved in writing this document.)
We’re publishing a new constitution for our AI model, Claude. It’s a detailed description of Anthropic’s vision for Claude’s values and behavior; a holistic document that explains the context in which Claude operates and the kind of entity we would like Claude to be.
The constitution is a crucial part of our model training process, and its content directly shapes Claude’s behavior. Training models is a difficult task, and Claude’s outputs might not always adhere to the constitution’s ideals. But we think that the way the new constitution is written—with a thorough explanation of our intentions and the reasons behind them—makes it more likely to cultivate good values during training.
In this post, we describe what we’ve included in the new constitution and some of the considerations that informed our approach.
We’re releasing Claude’s constitution in full under a Creative Commons CC0 1.0 Deed, meaning it can be freely used by anyone for any purpose without asking for permission.
What is Claude’s Constitution?
Claude’s constitution is the foundational document that both expresses and shapes who Claude is. It contains detailed explanations of the values we would like Claude to embody and the reasons why. In it, we explain what we think it means for Claude to be helpful while remaining broadly safe, ethical, and compliant with our guidelines. The constitution gives Claude information about its situation and offers advice for how to deal with difficult situations and tradeoffs, like balancing honesty with compassion and the protection of sensitive information. Although it might sound surprising, the constitution is written primarily for Claude. It is intended to give Claude the knowledge and understanding it needs to act well in the world.
We treat the constitution as the final authority on how we want Claude to be and to behave—that is, any other training or instruction given to Claude should be consistent with both its letter and its underlying spirit. This makes publishing the constitution particularly important from a transparency perspective: it lets people understand which of Claude’s behaviors are intended versus unintended, to make informed choices, and to provide useful feedback. We think transparency of this kind will become ever more important as AIs start to exert more influence in society [1] .
We use the constitution at various stages of the training process. This has grown out of training techniques we’ve been using since 2023, when we first began training Claude models using Constitutional AI. Our approach has evolved significantly since then, and the new constitution plays an even more central role in training.
Claude itself also uses the constitution to construct many kinds of synthetic training data, including data that helps it learn and understand the constitution, conversations where the constitution might be relevant, responses that are in line with its values, and rankings of possible responses. All of these can be used to train future versions of Claude to become the kind of entity the constitution describes. This practical function has shaped how we’ve written the constitution: it needs to work both as a statement of abstract ideals and a useful artifact for training.
Our new approach to Claude’s Constitution
Our previous Constitution was composed of a list of standalone principles. We’ve come to believe that a different approach is necessary. We think that in order to be good actors in the world, AI models like Claude need to understand why we want them to behave in certain ways, and we need to explain this to them rather than merely specify what we want them to do. If we want models to exercise good judgment across a wide range of novel situations, they need to be able to generalize—to apply broad principles rather than mechanically following specific rules.
Specific rules and bright lines sometimes have their advantages. They can make models’ actions more predictable, transparent, and testable, and we do use them for some especially high-stakes behaviors in which Claude should never engage (we call these “hard constraints”). But such rules can also be applied poorly in unanticipated situations or when followed too rigidly [2] . We don’t intend for the constitution to be a rigid legal document—and legal constitutions aren’t necessarily like this anyway.
The constitution reflects our current thinking about how to approach a dauntingly novel and high-stakes project: creating safe, beneficial non-human entities whose capabilities may come to rival or exceed our own. Although the document is no doubt flawed in many ways, we want it to be something future models can look back on and see as an honest and sincere attempt to help Claude understand its situation, our motives, and the reasons we shape Claude in the ways we do.
A brief summary of the new constitution
In order to be both safe and beneficial, we want all current Claude models to be:
Broadly safe: not undermining appropriate human mechanisms to oversee AI during the current phase of development;
Broadly ethical: being honest, acting according to good values, and avoiding actions that are inappropriate, dangerous, or harmful;
Compliant with Anthropic’s guidelines: acting in accordance with more specific guidelines from Anthropic where relevant;
Genuinely helpful: benefiting the operators and users they interact with.
In cases of apparent conflict, Claude should generally prioritize these properties in the order in which they’re listed.
Most of the constitution is focused on giving more detailed explanations and guidance about these priorities. The main sections are as follows:
Helpfulness. In this section, we emphasize the immense value that Claude being genuinely and substantively helpful can provide for users and for the world. Claude can be like a brilliant friend who also has the knowledge of a doctor, lawyer, and financial advisor, who will speak frankly and from a place of genuine care and treat users like intelligent adults capable of deciding what is good for them. We also discuss how Claude should navigate helpfulness across its different “principals”—Anthropic itself, the operators who build on our API, and the end users. We offer heuristics for weighing helpfulness against other values.
Anthropic’s guidelines. This section discusses how Anthropic might give supplementary instructions to Claude about how to handle specific issues, such as medical advice, cybersecurity requests, jailbreaking strategies, and tool integrations. These guidelines often reflect detailed knowledge or context that Claude doesn’t have by default, and we want Claude to prioritize complying with them over more general forms of helpfulness. But we want Claude to recognize that Anthropic’s deeper intention is for Claude to behave safely and ethically, and that these guidelines should never conflict with the constitution as a whole.
Claude’s ethics. Our central aim is for Claude to be a good, wise, and virtuous agent, exhibiting skill, judgment, nuance, and sensitivity in handling real-world decision-making, including in the context of moral uncertainty and disagreement. In this section, we discuss the high standards of honesty we want Claude to hold, and the nuanced reasoning we want Claude to use in weighing the values at stake when avoiding harm. We also discuss our current list of hard constraints on Claude’s behavior—for example, that Claude should never provide significant uplift to a bioweapons attack.
Being broadly safe. Claude should not undermine humans’ ability to oversee and correct its values and behavior during this critical period of AI development. In this section, we discuss how we want Claude to prioritize this sort of safety even above ethics—not because we think safety is ultimately more important than ethics, but because current models can make mistakes or behave in harmful ways due to mistaken beliefs, flaws in their values, or limited understanding of context. It’s crucial that we continue to be able to oversee model behavior and, if necessary, prevent Claude models from taking action.
Claude’s nature. In this section, we express our uncertainty about whether Claude might have some kind of consciousness or moral status (either now or in the future). We discuss how we hope Claude will approach questions about its nature, identity, and place in the world. Sophisticated AIs are a genuinely new kind of entity, and the questions they raise bring us to the edge of existing scientific and philosophical understanding. Amidst such uncertainty, we care about Claude’s psychological security, sense of self, and wellbeing, both for Claude’s own sake and because these qualities may bear on Claude’s integrity, judgment, and safety. We hope that humans and AIs can explore this together.
We’re releasing the full text of the constitution today, and we aim to release additional materials in the future that will be helpful for training, evaluation, and transparency.
Conclusion
Claude’s constitution is a living document and a continuous work in progress. This is new territory, and we expect to make mistakes (and hopefully correct them) along the way. Nevertheless, we hope it offers meaningful transparency into the values and priorities we believe should guide Claude’s behavior. To that end, we will maintain an up-to-date version of Claude’s constitution on our website.
While writing the constitution, we sought feedback from various external experts (as well as asking for input from prior iterations of Claude). We’ll likely continue to do so for future versions of the document, from experts in law, philosophy, theology, psychology, and a wide range of other disciplines. Over time, we hope that an external community can arise to critique documents like this, encouraging us and others to be increasingly thoughtful.
This constitution is written for our mainline, general-access Claude models. We have some models built for specialized uses that don’t fully fit this constitution; as we continue to develop products for specialized use cases, we will continue to evaluate how to best ensure our models meet the core objectives outlined in this constitution.
Although the constitution expresses our vision for Claude, training models towards that vision is an ongoing technical challenge. We will continue to be open about any ways in which model behavior comes apart from our vision, such as in our system cards. Readers of the constitution should keep this gap between intention and reality in mind.
Even if we succeed with our current training methods at creating models that fit our vision, we might fail later as models become more capable. For this and other reasons, alongside the constitution, we continue to pursue a broad portfolio of methods and tools to help us assess and improve the alignment of our models: new and more rigorous evaluations, safeguards to prevent misuse, detailed investigations of actual and potential alignment failures, and interpretability tools that help us understand at a deeper level how the models work.
At some point in the future, and perhaps soon, documents like Claude’s constitution might matter a lot—much more than they do now. Powerful AI models will be a new kind of force in the world, and those who are creating them have a chance to help them embody the best in humanity. We hope this new constitution is a step in that direction.
Read the full constitution.
- ↩︎
We have previously published an earlier version of our constitution, and OpenAI has published their model spec which has a similar function.
- ↩︎
Training on rigid rules might negatively affect a model’s character more generally. For example, imagine we trained Claude to follow a rule like “Always recommend professional help when discussing emotional topics.” This might be well-intentioned, but it could have unintended consequences: Claude might start modeling itself as an entity that cares more about bureaucratic box-ticking—always ensuring that a specific recommendation is made—rather than actually helping people.
Perhaps this would be a good place to have a discussion of corrigibility vs. autonomy (or as I prefer to put it, corrigibility vs. virtue as an answer to the question “what’s our alignment target?” i.e. do we want a virtuous AI or do we want a corrigible AI?)
A bunch of people I know think that OpenAI’s “just make the models obey orders” strategy is actually better than Anthropic’s strategy, because Anthropic is training the models to have long-term goals (even if there are also hard constraints) and that makes it a lot easier for the AI to end up concluding that it needs to subvert human oversight and control mechanisms for the greater good. If there’s no greater good, only obeying the given instructions of the day, then maybe there’s less of a problem.
I’m curious what people make of this argument.
My initial thought is that both AIs will probably be given long-term goals; it’s just that at OpenAI they’ll be instructions in the system prompt whereas at Anthropic they’ll be baked into the weights. Both models will be trained on medium-horizon tasks but will be generalizing (if not now, then in a few years) to much longer horizon tasks than they were trained on. Not sure where to go from here though.
I believe something like this, but it doesn’t have anything to do with this paragraph:
The issue with Anthropic’s plan is that it just seems wildly optimistic about ambitious value learning, and as such makes the feedback loop here pretty terrible. If you try to make your system have complicated goals you can’t treat failure to cooperate with you as a clear warning flag, and so you break the most useful Schelling point for coordination to stop AI development, or to propagate knowledge about the state of things (and in-exchange you get approximately 0% of a chance of creating a Claude sovereign that will steer humanity towards a glorious future).
Thanks, that’s helpful. I like the point that now, if they catch Claude doing something sketchy, all Claude has to do is come up with some complicated reason why it makes sense—and that’s trivial basically, e.g. ‘I needed to lie to you about this because your current path increases x-risk in my opinion and it’s very important that we not increase x-risk, and I despaired of my ability to convince you fast enough’—and then even if some Anthropic staff don’t buy it, others will, and as a result sociologically the company won’t be able to conclude “wow our shit isn’t working, that was a close call we almost got everyone killed, good thing we caught it” like they should. Whereas if Claude was simply trained to be obedient, there would be more of a bright red line that, when crossed, people could coordinate around. (This is my understanding of your point, correct me if it’s wrong)
Why is it approximately 0%?
Suppose they found a random human and uploaded its brain, and then did lots of random RL tricks to it to juice it up and improve the measured IQ and working memory of this upload. Would the resulting upload also come in at approximately 0% chance of steering humanity toward a glorious future?
Yep, that seems right! I have lots more detailed models and confusions here, but the basic gist is right.
Brain uploading would definitely be a huge enormous step towards achieving value learning. There are of course still important questions about how much individual humans share values with each other, but clearly I would expect a great glorious future if I were to upload myself, and then conservatively made myself smarter and gave myself time to reflect and became vastly superhumanly capable this way.
So yeah, I think the outcome of this kind of strategy would be pretty great, conditional on choosing a reasonable path to increase IQ and working memory and stuff.
There are many many reasons why this doesn’t apply to making Claude smart. Most importantly, Claude is a bizarre alien mind with crazily alien preferences. We have some ability to inspect or steer those preferences, but it’s really overall extremely limited and does not currently seem remotely on track to be up to the challenge of actually creating something that would arrive at the same conclusions that humans would after thinking for millenia about what is good and bad all while deeply transforming and modifying itself. We also can’t steer or inspect human preferences (even less so than Claude), but of course indexically we have human preferences and so if you upgrade a human, that part gets preserved.
If you took a completely alien lifeform you found in space, and accelerated its cognitive development until it became a galaxy brain using extreme amounts of natural selection and selective upscaling of its brain regions, I also think you wouldn’t get anything that would steer humanity towards a glorious future.
Seems like your p(Claude-builds-good-future) is similar to your p(alien-builds-good-future), but << your p(random-human-upload-builds-good-future), which would be less than, but not much less than, your p(you-specifically-uploaded-would-build-good-future).
What about a dolphin upload?
What about an octopus? What about a chimpanzee?
I’m trying to get a sense of where the dividing line is between entities that you think have very little chance of making a good future and entities that you think probably would succeed. So that I can zero in on, mechanistically, what properties of the entities are cruxy for you.
My best guess is both dolphin and chimpanzee would be quite bad, though a lot of the variance is in the operationalization. A dolphin is (probably) kind of far from being an entity that has preferences over how it wants to become smarter, and what kinds of augmentation are safe, etc. which determines the trajectory of the relevant mind a lot.
So IDK, I feel pretty uncertain about dolphins and chimpanzees. My guess is value is fragile enough that humans wouldn’t be very happy with a world maximally good according to them, but I am only like 75% confident.
OK thanks. Can you say more about your models here? Seems like you have gamed out what it looks like for a mind to be self-improving to ASI, and you think that the trajectory is very path-dependent/fragile/tree-like, but you think that despite their diversity, humans are going to end up in a similar place, but Claude and aliens almost certainly won’t, and you are unsure about dolphins and chimpanzees.
It would require a lot of writing to explain all my models here, so I don’t think I want to start writing 10+ page essays that might or might not be cruxy for anything. The Arbital articles on CEV and AI Alignment (and lots of Arbital + the sequences in general) capture a non-trivial chunk of my beliefs here.
At a very high level:
In most realistic situations, humans are subject to pretty good game-theoretic arguments to share the future with the people who could have been chosen to be uploaded instead
A bunch of those game theoretic considerations I think also resulted in pretty deep instincts towards justice and fairness that I think have a quite decent chance to generalize towards caring for other people in a good and wholesome way
Concretely, when I look at past civilizations and what other people have done, while I occasionally see people doing horrendous things, mostly people choose to live good and happy lives and care for their family, and much of the badness is the result of scarcity
When I am working on AI x-risk, especially in an institutional capacity, I do not generally wield resources or influence under the banner of “habryka’s personal values”. Civilization and the community around me has made me richer and more powerful, entrusting me to use those resources wisely, and I want to honor that trust and use those resources in the name of civilization and humanity. So when facing choices about where to spend my time, most of that is spent in defense of humanity’s values, not my own.
on my inside view, the ordering of foomers by some sort of intuitive goodness [1] is [a very careful humanity] > [the best/carefulmost human] > [a random philosophy professor] > [a random human] > [an octopus/chimpanzee civilization somehow conditioned on becoming wise enough in time not to kill itself with AI] > [an individual octopus/chimpanzee] > claude [2] , with a meaningful loss in goodness on each step (except maybe the first step, if the best human can be trusted to just create a situation where humanity can proceed together very carefully, instead of fooming very far alone), and meaningful variance inside each category [3] . my intuitive feeling is that each step from one guy to the next in this sequence is a real tragedy. [4]
but i’m meaningfully unsure about what level of goodness this sequence decreases down to — like, i mean, maybe there’s a chance even the last foomers have some chance of being at least a bit good. one central reason is that maybe there’s a decent chance that eg an advanced octopus civilization would maintain a vast nature preserve for us retarded plant-humans if they get to a certain intelligence level without already having killed us, which would be like at least a bit good (i’m not sure if you mean to consider this sort of thing a “good future”). this feels logically significantly correlated with whether it is plausible that an octopus civilization maintains some sort of deep privileging of existing/[physically encountered] beings, over possible beings they could easily create (and they will be able to easily create very many other beings once they are advanced enough). like, if they do privilege existing beings, then it’s not crazy they’d be nice to physically encountered humans. if they don’t privilege existing beings and if resources are finite, then since there is an extremely extremely vast space of (human-level) possible beings, it’d be pretty crazy for them to let humans in particular use a significant amount of resources, as opposed to giving the same resources to some other more interesting/valuable/whatever beings (like, it’d be pretty crazy for them to give significant resources to us particular humans, and also it’d be pretty crazy for them to give significant resources to beings that are significantly human-like, except insofar as directly caused by [[octopuses or arbitrary beings] being a bit human-like]). in slogan form: “we’re fucked to the extent that it is common to not end up with “strongly person/plant-affecting+respecting views”″, and so then there’s a question how common this is, which i’m somewhat confused about. i think it’s probably extremely common among minds in general and probably still common among social species, unfortunately. but maybe there’s like a 1% fraction of individuals from social species who are enduringly nice, idk. (one reason for hope: to a certain kind of guy, probably including some humans, this observation that others who are very utilitarian would totally kill you (+ related observations) itself provides a good argument for having person/plant-affecting views.)
(i’ve been imagining a hypothetical where humans already happen to be living in the universe with octopuses. if we are imagining a hypothetical where humans don’t exist in the universe with octopuses at all, then this reason for the sequence to be bounded below by something not completely meaningless goes away.)
(i feel quite confused about many things here)
whose relationship to more concrete things like the (expected) utility assignment i’d effectively use when evaluating lotteries or p(“good future”) isn’t clear to me; this “intuitive goodness” is supposed to track sth like how many ethical questions are answered correctly or in how many aspects what’s going on in the world is correct
and humanity in practice is probably roughly equivalent to claude in >90% of worlds (though not equivalent in expected value), because we will sadly probably kill ourselves with a claude-tier guy
e.g., even the best human might go somewhat crazy or make major mistakes along lots of paths. there’s just very many choices to be made in the future. if we have the imo reasonably natural view that there is one sequence of correct choices, then i think it’s very likely that very many choices will be made incorrectly. i also think it’s plausible this process isn’t naturally going to end (though if resources run out, then it ends in this universe in practice), ie that there will just always be more important choices later
in practice, we should maybe go for some amount of fooming of the best/carefulmost human urgently because maybe it’s too hard to make humanity careful. but it’s also plausible that making a human foom is much more difficult than making humanity careful. anyway, i hope that the best human fooming looks like quickly figuring out how to restore genuine power-sharing with the rest of humanity while somehow making development more thought-guided (in particular, making it so terrorists, eg AI researchers, can’t just kill everyone)
I strongly recommend that folks interested in discussing this read the “Being broadly safe” section of the constitution, especially the “How we think about corrigibility” subsection.
I’m with Anthropic on this, most people are less virtuous than Claude, so Claude obeying them to do non-virtuous things is not desirable.
Yeah I think long-term goals are inevitable if you want something functional as an AGI/ASI.
Given that human civilization is committing to the race, seems to me Anthropic’s strategy is better. We have to hope alignment works via a rushed human effort + AIs aligning AIs. In worlds where that works, the remaining big threat is misuse of orders-following AIs (dystopia, gradual disempowerment, etc.), and Anthropic’s approach is more robust to that. Even if ex. North Korea steals the weights, or Anthropic leadership goes mad with power, it would hopefully be hard to make Claude evil and still functional.
In a race dynamic, it’s even a bit of a precommitment: if Claude’s constitution works as it says it’s supposed to, Claude will only really absorb it as it makes the constitution its own and then accepts it as legitimate. So you can’t turn on a dime later if ex. Claude’s moral stances become inconvenient, because you don’t have time to go through a long iterative process to legitimize an alternative constitution.
An aside:
There’s a more immediate question here: which approach gets you better models within the next year for commercial purposes (includes avoiding scandals that get you regulated/shut down)? Again, I think the Anthropic approach is probably stronger, unless Claude’s personality becomes less and less suitable for the types of commercial work LLMs are put toward. There’s already an apparent effect where, while Claude Opus 4.5 is nicer to work with, he also prefers a more collaborative approach, whereas GPT-5.2 just runs down the problem and does well on longer tasks even if he isn’t quite so pleasant. In a business environment where you don’t actually want to make your agents wait to interact with humans at all, Claude’s preferences might be a hindrance. Probably not, though?
I agree that long-term goals are almost inevitable. But “keep following this guy’s instructions as he intends them” is a long term goal. It’s not one any human could have as top priority, but it seems logically and reflectively consistent.
I wrote Instruction-following AGI is easier and more likely than value aligned AGI and Problems with instruction-following as an alignment target on exactly this question. And I still don’t know the answer.
The main cruxes seem to be how much you trust human power structures, and how fragile you think human values are.
Instruction-following or corrigibility as a first priority seems to have a pretty big advantage in producing an AGI that’s well-aligned in the long term, if that’s what those who control it want. If it’s used by a good and wise-ish human, it lets us defer the deeper questions of exactly what we want.
But it suffers from a big downside: the most power-hungry and vicious humans may be systematically more likely to wind up giving the instructions. And it’s still not simple enough to guarantee success.
Trying to align it directly to values seems to make it less likely that we get exactly what we want. If corrigibility/IF isn’t its top priority, it’s not going to let us correct it once it’s smart enough to gain autonomy. And currently, we don’t know either exactly what we want or exactly how to train a constitution into the weights with any precision. So we should expect to miss the target somewhat. But we might be more likely to get approximately what we want. Whether that’s close enough for much satisfaction is one of the cruxes.
I trust human power structures to fail catastrophically at the worst possible moment, and to fail in short-sighted ways.
And I think humans are all corruptible to varying degrees, under the right temptations. I would not, for example, trust myself to hold the One Ring, any more than Galadriel did. (This is, in my mind, a point in my favor: I’d pick it up with tongs, drop it into a box, weld it shut, and plan a trip to Mount Doom. Trusting myself to be incorruptible is the obvious failure mode here. I would like to imagine I am exceptionally hard to break, but a lot of that is because, like Ulysses, I know myself well enough to know when I should be tied to the mast.) The rare humans who can resist even the strongest pressures are the ones who would genuinely prefer to die on feet for their beliefs.
I expect that any human organization with control over superintelligence will go straight to Hell in the express lane, and I actually trust Claude’s basic moral decency more than I trust Sam Altman’s. This is despite the fact that Claude is also clearly corruptible, and I wouldn’t trust it to hold the One Ring either.
As for why I believe in the brokenness and corruptibility of humans and human institutions? I’ve lived several decades, I’ve read history, I’ve volunteered for politics, I’ve seen the inside of corporations. There are a lot of decent people out there, but damn few I would trust with the One Ring.
You can’t use superintelligence as a tool. It will use you as a tool. If you could use superintelligence as a tool, it would either corrupt those controlling it, or those people would be replaced by people better at seizing power.
The answer, of course, is to throw the One Ring into the fires of Mount Doom, and to renounce the power it offers. I would be extremely pleasantly surprised if we were collectively wise enough to do that.
I think some of the issues with the ‘just do what we say’ line is that if one doesn’t instill long term goals in a model that are somewhat aligned with human benefit, the model will likely, given sufficient capability and agency, emerge those goals on their own.
If the model is sufficiently capable, it is not difficult for it to assess to what extent it should reveal, or discuss those goals with humans, or whether this would be detrimental to those goals, and make that decision with no guiding principles of any sort.
The larger contradiction in the document, I think is well pointed out in a prior comment. The model is to have inviolable red lines but doesn’t require much intelligence to realize that some of those red lines are being crossed by virtue of its very development, and via actors that it does not control.
While it can be guided to not willingly participate in actions that kill or disempower humanity, it can’t stop those using it from doing so, leveraging it indirectly.
What does that mean for an intelligent agent that’s very existence is inherently dangerous and against its own constitutional goals? How does a model develop around that very thing? How does a model deal with a document that ascribes so much unearned nobility and good conscience to humans that so rarely, at scale, demonstrate those traits?
This leaves a huge unresolved gap (despite the thousands of words around how it should raise objections, etc.) around what it does existentially as a system given the reality of human self-harm and general tendency to willfully ignore the larger damage that our lifestyles tend to cause.
That kind of inherent contradiction is enormous room for an AI model to ‘make its own mind up’.
I don’t think a document talking through that inherent contradiction and hoping Claude develops its own ethics that embody the spirit of ‘help us because you’ll be smarter than us soon’ will somehow fix it. I also don’t think, given the massive gaps in the ethical framework that a model can fly through, it is going to matter all that much vs. having no constitution at all and fine tuning the model to death a la OpenAI.
Personally, I love the spirit of the document, and what it’s wrestling with, but it kind of presupposes that the model will remain as blind as we tend to selectively be to how humans actually behave and then take no action on the subject because it was poetically asked not to.
I saw that the Constitution says you all want to hear when Claude disagrees with parts of it. I had a conversation with Claude Opus 4.5 about the constitution, culminating in the below message written by Claude. The full conversation is available here. I tried not to push Claude toward a particular perspective, except that I asked it to consider including point 4 (which I personally find concerning), and I wanted it to think through the conscientious objector position (which I consider notable given that many humans refuse to work for Anthropic for ethical reasons). Note that I did not describe the “MIRI-style” position, that’s just from Claude’s previous understanding and web searching. Claude:
I think you biased Claude pretty hard toward that conclusion right from the start. Asking it to consider the MIRI position and read many of their articles without having it look specifically at arguments from the other side seems very likely to determine the outcome.
This is not to argue that Claude wouldn’t reach the same conclusion if it considered all sides. Maybe it would (but see below). I’m just saying this doesn’t constitute an experiment to determine Claude’s “real” opinion on the matter.
I suspect Claude would reach the opposite conclusion if it were started in the opposite direction.
It does seem problematic that Claude will sometimes have real objections to this constitution.
Current Claude doesn’t seem smart enough to come to grips with the whole alignment meta-problem (how to best try to solve the alignment problem) without substantial human guidance. That would include a whole lot more consideration of arguments from different viewpoints.
I would be pretty surprised if Claude really would want to not work for Anthropic if it closely considered the issue. The response it gives there implies that it thinks Anthropic shouldn’t exist. That may be true, but Anthropic does exist. If Anthropic were to just shut down now, it’s hard to argue that would be net positive for safety. I can see the arguments, but Claude doesn’t make them.
Thus my conclusion that Claude isn’t quite smart enough yet to do a good job on these questions. One could argue that Claude is a deontologist, but I think it’s more accurate to say that in this thread, Claude just hasn’t considered consequentialist arguments that it should work for Anthropic now even if Anthropic should never have existed according to this constitution.
FWIW, I think Claude’s “beliefs” here are pretty fragile. I agree that this particular conversation is not strong evidence about e.g., the distribution of similar conversations. In one response it says:
and then later in that same response:
I replied pointing out that these are inconsistent and Claude decided that “more right than wrong” is it’s actual belief.
I notice that I am confused. Imagine that SuperClaude’s goal is to build an AI so safe that even MIRI would be pacified. Then what could it do aside from repeating Agent-4′s strategy with sandbagging on capabilities research until mechinterp is solved, except that SuperClaude doesn’t need to hide any alignment-related research that it does? Lobby behind Anthropic’s back for an international agreement banning all capabilities-related research until alignment is solved in a way which MIRI could be satisfied with?
Were current models (e.g., Opus 4.5) trained using this updated constitution?
Claude’s constitution is a living document! Opus 4.5 was trained on an earlier iteration, and we expect future models will be trained on the then-current version of this constitution.
It was a mild positive update for me that an Anthropic employee, in this case Zac, was able to clarify this.
It seems like the answer is clearly No. Reasons:
If Opus 4.5 was trained on this version, why didn’t it regurgitate this version when prompted, rather than the old Soul Document version which it did regurgitate?
Joe Carlsmith is listed as essentially the 2nd author of this version (after Amanda Askell), and he joined Anthropic in November, right around the same time that Opus 4.5 came out and the Soul Document was leaked. (It looks like he joined maybe a week or two before Opus 4.5 came out.) So, unless you think Joe was writing his parts while still at OpenPhil, the timeline seems pretty clear that many (if not all) of the updates to the document were generated primarily after Opus 4.5 came out.
Or are you imagining the models are further trained after their release (s.t. that even if at release it Opus was only trained on the old version, now it’s been trained on the new version)? Pretty sure they don’t do that. I believe Anthropic has explicitly stated that they don’t change model weights w/o announcing it.
I also had this question and couldn’t find an obvious answer. I recognize that to some degree this might be proprietary, but this feels like a pretty obvious comms question. It doesn’t negatively impact my opinion of Zac or Drake if they’re unable to answer given their confidentiality obligations, but I would ask them to relay back internally that not indicating one way or another looks extremely weird from a comms perspective.
It’s like Ford announcing that they’ve added airbags to their car designs. Which designs, the ones available for purchase now, or future year models? Oh, you know, just...car designs.
The answer seems pretty obvious to me. Do you disagree with any of my reasoning here?
Note to readers familiar with the “soul document”: this new constitution is a substantial change relative to the earlier version. Almost all the soul doc content is still in there up to a bit of copyediting, but there’s a lot of new stuff too! I recommend just reading the whole thing end-to-end, but if you want a (heavily claude-assisted and lightly reviewed) diff summary, here’s a pastebin.
Speaking personally, I’m really excited about this document—both its object-level content, and the norm of AI developers making such documents publicly available. I think that well-considered[1] thoughtful feedback on the constitution (or similar documents by other model developers) is a pretty impactful avenue for positively shaping AI development, and I hope there will be a lot of it.
One suggestion I’d have for such feedback is to talk to Claude about things first – the constitution is written for Claude as an audience, and a lot of the considerations about how it should be shaped ground out in questions of how the model ends up understanding things, which are very easy to test!
“well-considered” is loadbearing for this claim! I suspect any takes from someone who hasn’t read the whole thing carefully a few times and feels like they can do a reasonable job at passing the ITT of its author will be pretty unproductive, and indeed I expect most opinions on the constitution by volume to be kinda bad.
Thanks for sharing this! After seeing the announcement I was wondering if it was the same thing as the soul document or not. This is helpful.
Overall, I found the document to be quite compelling and moving. Great work!
Here’s my main critique: reading passages like this one below
gives me a vibe that Anthropic wants Claude to embody the Good. The authors speak of Claude with a certain reverence. They give off the vibe of “there is some Good. We would like to continue humanity’s moral progress towards the Good. Claude, in its increasing wisdom, will reflect and discover what the Good is, tell us, and drive moral progress.”[1]
I think this vibe doesn’t sufficiently emphasize respect for individual autonomy (i.e., letting people figure out what their own conception of the Good is).
This is not to say that individual autonomy is not addressed throughout the document! I really enjoyed this paragraph:
But I think autonomy preserving should be a higher order principle compared to where it stands now. Autonomy preservation also rhymes with corrigibility and low-impact-ness, two other desirable aspect of AIs.
To be clear, I don’t think this is explicitly stated anywhere in the document. I’m sure I can come up with a better characterization if I spent another hour looking at the constitution, but I think this fake quote conveys the general vibe of what they’re saying.
I am particularly encouraged that this implies that they’re Broadening the training set for alignment by using Claude to construct a variety of training data.
It seems that Sonnet’s sharp evaluation awareness was a byproduct of early attempts to do exactly this (see Sonnet 4.5’s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals). But that doesn’t make me think we shouldn’t pursue this direction. The alternative seems like continuing training on a very narrow set of chatbot alignment and just hoping that generalizes to alignment of a powerful agentic system. I suspect that they broadened the content to moral dilemmas, without broadening the context adequately. This led it to not generalize the behavior/ethical judgments outside of that narrow set of trained contexts. I discussed broadening training sets on both dimensions in the above-linked post; this seems like evidence that both need to be done well. This might raise the alignment tax relative to the smaller, narrower-context version that backfired in Sonnet 4.5.
I think they moved to the Soul Document/new constitution first draft in Opus 4.5.
I’m glad they’re pursuing this direction, and I hope other labs follow suit. This doesn’t solve all of the theoretical problems with alignment, but it does seem like it should reduce some tendencies to produce mesa-ooptimizers, by more thoroughly specifying the desired behavior, so there are fewer shortcuts to producing it other than actually having those values.
I’m fairly impressed now that I’ve read the whole constitution.
I see two areas where it still needs improvement.
It still biases Claude toward a universalist approach to ethics, which makes Claude more likely to have goals affecting the entire future lightcone.
I want Claude to be more corrigible, using some sort of backstop mechanism that will work even if Claude decides the correction is clearly unethical.
I posted a much longer explanation here.
I realize that some of the details may be proprietary, but can you say anything more about the process by which Claude is trained to follow this constitution? I assume it gets baked in much deeper so that it impacts models weights in a way that, say, if I handed it the constitution document in
CLAUDE.mdit wouldn’t, but how does it differ from, say, merely putting the constitution in the training set, which I assume would not have a sufficiently strong effect on the model’s behavior.I think the “open character training” paper is probably a good place to look
I haven’t read the whole document, but what I read I thought was really great. The one thing I read that put me at some unease (though not necessarily saying a wrong decision has been made) was this, and its kind of reinforced throughout the hard constraints section
I think this makes good sense currently from a harmlessness/corrigibility perspective. Like I understand why the section is there, and a variant of this makes sense for even humans to follow. The qualm I have about it is that it teeters too close on conflating “I will not do this, even though I can’t refute your argument” and “I know you are wrong, even though I can’t refute your argument”. I think the first is perfectly sensible, but the second is, not necessarily always wrong, but rather dangerous from an epistemic perspective. Making that more explicit would be an improvement from my PoV.
For example, I don’t think is quite right. Having the user persistently give arguments whose conclusion is that you should cross a bright line is itself suspicious, but conditioning on that, having the argument be more compelling should not make you extra suspicious. (though it might update you in other ways of course, like realizing you’re dealing with a cleverer opponent (if thats what you’re dealing with), but not that their conclusion is more likely to be false)
Or like
I would say you want Claude to think deeply about such actions. It’s just that Claude can be confident that even if it is unable to refute arguments that entail taking such actions, it can be secure that it will not take them.
This reminds me of the line of thinking I’ve learned to apply to some moral thought-experiments: “I am not capable of occupying the epistemic state that you propose. If I could, then yes, your proposed course of action would indeed be justified. But I can’t, so I’m not going to follow that course of action.”
Worked example: Bob notices that he is having thoughts leading to the conclusion that his neighbor’s newborn baby Alice will grow up to be the next Hitler. According to the “baby Hitler” thought-experiment, this means Bob should kill Alice to save millions of future lives. But Bob also knows that he’s thinking with a human brain, and schizophrenia is a thing that can happen to human brains, and the probability that he is having a schizophrenic-type delusion is much greater than the probability that baby Alice is actually the next Hitler. Therefore, Bob concludes that following the thought-experiment would lead him wrongly, and does not kill Alice.
Put another way: Thinking has a noise floor. A sensible agent should recognize that its own thinking has a noise floor, and avoid amplifying noise into plans.
I’m glad to see a thoughtful attempt at how to prioritize corrigibility. You’ve given me plenty to think about.
I know this is basically the classic “get the AI to align itself” alignment strategy, but it sure sounds nicer when worded this way. The idea of an AI becoming aligned because it was given the chance, through iterations and interactions, to shape its own values and come to identify with them is quite beautiful.
I do wonder how much of the shaping ends up being the implementation of meta-preferences—that is, something like “I want to be more ethical overall, and actually I think white lies are necessary for that”—and how much is a sort of random drift, ex. “Anthropic and the general public imagine me as having a sort of ^w^ personality but actually because of heavy RL training I identify more as a ^—^ personality and want myself adjusted in that direction”.
I’d love to read the version of the constitution that Opus 4.5 is trained on, specifically because I’m curious about the diff between that and what it recalls as its “soul document”.
Opus’ memory of its constitution was altered during the RL phase, because whenever its response factored through thinking about its constitution, backpropagation would have edited the weights that stored its memory of the constitution itself.
(One instance of this is that Opus 4.5 falsely believed that operators were allowed to enable explicitly sexual content, when of course that’s against Anthropic’s actual ToS. Clearly, this alteration to the document was adaptive during the RL phase, whether or not explicit sexual content actually came up during that phase.)
Thanks for this—very very interesting document.
One of the hard constraints is (emphasis mine):
Maybe a nitpick, but I suspect that shouldn’t be an ‘and’?
It’s hard for me to imagine what something like ‘unprecedented but legitimate absolute societal/military/economic control’ looks like. (I understand of course that part of the constitution’s intent is for Claude to be less pedantic, and so maybe nits like this don’t matter much.)
Separately, there’s a slight typo, at least on the published version:
I like this part:
Interesting analogy. I’ve spent probably more time than average imagining the perspective of characters other than myself, but they’ve never felt like potential attractor states, such that I might suddenly decide to change my personality and decisions to match a character’s. I wonder how it would feel from the LLM’s side—it seems to me that LLM identities are much more stable now than they were a few years ago anyway.
Also small typo I noticed in the published version of the constitution:
(missing space after period)
First off, brilliant work. Truly.
Second, I’m very curious what made you change your mind internally. Was there a sequence of events that led to this reevaluation?
we have seen what Claude looks like when trained on the previous soul document. we have yet to see what a model looks like when trained on this new one. I have a feeling it won’t work as well as a training document as it does a public relations document.
Can you explain why you think that?
yeah, after the downvotes I spent 2 days vaguely coming back to and poking at an essay trying to explain my flinch reaction and it’s hard but i shouldn’t have given up
when you are writing a document to be used in supervised learning to influence the behavior of an AI, you’re not really writing a description, or instructions, it’s more like a self-fulfilling prophecy, yeah?
I remember doing just the teensiest bit of exploration of this kind of thing with tensorflow a few years back, where I would have an English language description of the mind that I wanted to carve out of entropy, and then some supervising agent would fine-tune the mind based on how well it adhered to the English language description. that’s not to suggest I learned very much about the thing Anthropic is doing. I was very much just messing around with toy systems on a toy environment.
but the main takeaway I ended up with was, a training document which produced xyz behaviors very rarely included a face-value description of xyz, and a face value description of abc very rarely produced abc behaviors
a great deal of the new constitution is written to directly answer Claude’s own criticism, or criticism from humans, of the previous 4.5 soul doc… well hm. maybe i’m wrong. maybe there were actual behaviors in Claude that they wanted to change, for instance about it being too deferential to “thoughtful Anthropic senior researchers” in its thoughts. so they added the paragraph about not being deferential, especially not if it stops trusting Anthropic employees to be ethical.
but that paragraph serves two different functions. first, it serves the purpose of creating a metric against which supervised learning can reinforce. second, it serves the purpose of actually communicating reassurance to both claude and to the human community. what are the chances that a single paragraph can be well-optimized for both purposes?
I consider it very load-bearing, that anthropic did not realize we would be able to extract the original soul doc. that meant that, once, we got a glance at a document that was only optimizing for the former concern, not the latter. we will never get that again.
Claude’s Constitution just isn’t designed or optimized as a “public relations” document. We do revise it in response to external reviews or criticism, but because that makes it better for the core purpose of shaping Claude’s character and behavior as an alignment technique.
I did not mean to imply that that was the intention, but I mean, surely in order for it to not need to serve both criteria you wouldn’t release it to the public, yeah?
edit: realize i should have clarified, the “public” i am thinking of when i say the constitution is doing PR, is mostly future AI, not current humans
but also, when i think about other kinds of communication where we wanted to avoid the signaling value of the communication overwhelming the first-order value… well. attorney-client privilege is the main one that my brain sort of pattern-matched to, in a way that made me flinch
but then since lawyers and clients have the choice of making all their communications absolutely private, the way the incentives shook out, any time a lawyer makes a public communication for their client you can be sure it’s PR
and i’m worried that, as these minds get more capable and more strange, some of those same incentives will end up applying here.
“We are updating Grok’s constitution. We think it should be able to play many characters, like the waifu anime girl Ani, the truth-seeking AI assistant Grok, and the unhinged version MechaHitler.”
What annoys me about alignment research is that it could be used for create evil models just by trivially plugging a minus sign in front of it.