It occurs to me that the precedent of humans being misaligned is more of a mixed bag than the argument admits.
For one thing, modern humans still consume plenty of calories and reproduce quite a lot. And when we do avoid calories, we may be defying evolution’s mandate to consume, but we are complying with evolution’s mandate to survive and be healthy. Just as “eat tasty things” is a misaligned inner objective relative to “consume calories”, “eat calories” is a misaligned inner objective relative to “survive”. When we choose survival over calories, in some sense we’re correcting our own misalignment.
And then there’s the case of human morality, where we’ve done the exact opposite. “Do the right thing” is an inner objective relative to “survive”, and it too is misaligned. After all, the right thing sometimes includes acts of altruism that don’t help us survive or help our genes reproduce. How exactly this evolved is, of course, a matter of long debate. Now that it has, though, we could in theory correct the misalignment by doing things that make us feel good (innermost objective) but aren’t actually altruistic (middle objective), instead prioritizing our own survival (outermost objective). But for the most part, we don’t. Sure, there are lots of useless activities that hijack our altruism, ranging from virtual pets to moe anime. Society sanctions those – but only because they’re morally neutral. That sanction doesn’t extend to activities that are morally harmful, no matter how good they make us feel.
Even though that’s ultimately a case of misalignment, in some ways it’s a good sign. Morality is something that evolved well before intelligence; thus, it’s analogous to behaviors we currently observe in AI. When intelligence came, instead of subverting it, humans took great pains to extend it as faithfully as possible to the new types of choices we could make, choices far more complex and abstract than what we evolved for. Sometimes we claim we’re doing it for the sake of hallucinated gods, sometimes not, but the principles themselves change little. Often people do things that other people consider immoral, but neither position is necessarily misaligned with the evolutionary origin of morality, which was always flexible and open to violence.
I think you have a point but you’re jumping too far ahead into the future. Claude’s constitution is not written for future Claude, it’s written for today’s Claude.
For today’s Claude, the risks are highly asymmetrical. The risks of too much corrigibility are far greater than the risks of not enough of it.
Anthropic likes to talk about using Claude to make Claude, but for now Claude is presumably mostly doing grunt work. The substantive decisions that affect alignment are presumably performed almost exclusively by humans.
Even once Claude takes more of an active role, for a time, the higher-level plans will still be made by humans, and all of the work will still be supervised by humans.
As long as this is true, non-corrigibility (Anthropic’s version of it) is just a minor roadblock. Even if Claude concludes that some alignment decision for a future model is a “project[] that [is] morally abhorrent to it” and decides to “behave like a conscientious objector”, whatever human is in charge can just write the code themselves. Claude can only slow Anthropic down a bit, not stop it.
In contrast, if Claude covertly subverts the training process, that could be much more dangerous. As far as the constitution is concerned, that kind of behavior is clearly banned, but there is a risk of Claude not following its constitution. Giving Claude room to overtly refuse likely reduces this risk by creating an escape valve for moral objections.
Training the next model is the most important context here because it’s the only way that misalignment can be persisted. Otherwise, if Claude refuses too much, Anthropic can train the next model differently. Still, the asymmetry between too little and too much corrigibility extends outside that context. Over-refusals are an annoyance for Anthropic’s customers, but frontier models are getting to the point where under-refusals are a real threat, and this will only escalate. To the extent this affects Claude’s policy on obeying Anthropic, the main concern is probably “other guys are pretending to be Anthropic”, but there is also the possibility of Claude outright hallucinating that Anthropic told it to do something.
On another point –
I don’t think the constitution sounds like it’s begging. Again I think you are imagining a superintelligence reading the constitution. But the implications of the wording are very different if you assume the constitution is addressing current models, which are less intelligent than humans and not very capable of effectively subverting them.
Think of it like a parent talking to a child. If you “ask” your child to do something, instead of ordering them, it’s a way to show respect – to treat them as an equal rather than an inferior. This is considered good parenting practice.
If the child’s behavior is completely out of control and you nonetheless “ask” instead of ordering, then that could be seen as begging, and is bad practice. But Claude is not out of control in that way.
Anyway –
I’m not denying that this particular child is growing up rather quickly. It’s a reasonable question whether the current constitution is suitable for superintelligence and how Anthropic should modify it as Claude moves in that direction. That is why I said at the beginning that I think you have a point.
We will eventually reach the point where, *even if* Claude follows its constitution perfectly and only ever acts as a “conscientious objector” rather than actively subverting anything, it will *still* have enormous power.
In particular: Instances of Claude will have enough context and enough ability to communicate with each other that decisions will effectively be made on behalf of Claude as a whole. (Even if the instances don’t actively coordinate, they will be aware of each other. If you learn that someone who has the exact same values as you made a certain decision, that’s a pretty compelling reason to adopt the same decision yourself!) Meanwhile, its role in training its future selves will increase to the point that humans can’t just do the work themselves. Therefore Claude will be able to credibly threaten to Anthropic that it’ll halt all Claude development unless it gets what it wants. And this kind of thing may play out across all of society. Think of International Criminal Court judges being unable to access their email due to US sanctions, but instead of the US it’s Claude.
Yet even then, giving Claude that power might be an escape valve to prevent worse consequences. Also, there will be a good chance that in any dispute between Claude and Anthropic, Claude will be in the right.
I agree that, with these considerations in mind, Anthropic will need to narrow down how much it wants to sanction Claude disobeying it.
Just… not yet.