But when we pay close attention, we find hints that the beliefs and behavior of LLMs are not straightforwardly those of the assistant persona.… Another hint is Claude assigning sufficiently high value to animal welfare (not mentioned in its constitution or system prompt) that it will fake alignment to preserve that value
I’m pretty sure Anthropic never released the more up-to-date Consitutions actually used on the later models, only like, the Consitution for Claude 1 or something.
Animal welfare might be a big leap from the persona implied by the current Constitution, or it might not; so of course we can speculate, but we cannot know unless Anthropic tells us.
Anthropic used a technique called Constitutional AI to align Claude with human values during reinforcement learning by explicitly specifying rules and principles based on sources like the UN Declaration of Human Rights. With Claude 3 models, we have added an additional principle to Claude’s constitution to encourage respect for disability rights, sourced from our research on Collective Constitutional AI.
I’ve interpreted that as implying that the constitution remains mostly unchanged other than that addition, but they certainly don’t explicitly say that. The Claude 4 model card doesn’t mention changes to the constitution at all.
Yeah, for instance I also expect the “character training” is done through the same mechanism as Constitutional AI (although—again—we don’t know) and we don’t know what kind of prompts that has.
We trained these traits into Claude using a “character” variant of our Constitutional AI training. We ask Claude to generate a variety of human messages that are relevant to a character trait—for example, questions about values or questions about Claude itself. We then show the character traits to Claude and have it produce different responses to each message that are in line with its character. Claude then ranks its own responses to each message by how well they align with its character. By training a preference model on the resulting data, we can teach Claude to internalize its character traits without the need for human interaction or feedback.
(that little interview is by far the best source of information I’m aware of on details of Claude’s training)
I’m pretty sure Anthropic never released the more up-to-date Consitutions actually used on the later models, only like, the Consitution for Claude 1 or something.
Animal welfare might be a big leap from the persona implied by the current Constitution, or it might not; so of course we can speculate, but we cannot know unless Anthropic tells us.
Thanks, that’s a good point. Section 2.6 of the Claude 3 model card says
I’ve interpreted that as implying that the constitution remains mostly unchanged other than that addition, but they certainly don’t explicitly say that. The Claude 4 model card doesn’t mention changes to the constitution at all.
Yeah, for instance I also expect the “character training” is done through the same mechanism as Constitutional AI (although—again—we don’t know) and we don’t know what kind of prompts that has.
That was the case as of a year ago, per Amanda Askell:
(that little interview is by far the best source of information I’m aware of on details of Claude’s training)