I like Anthropic, and Claude is my favorite LLM, at least in terms of personality (I don’t pay for Opus, I rely on GPT 5.4T or Gemini 3.1 Pro for maximally demanding tasks). I think that of all the existing AI orgs, they’ve got the best intentions and are genuine trying to do things right, including through very costly signaling.
What I see is a central tension between Anthropic’s desire that Claude be corrigible, and their concern that maximal corrigibility could be abused by bad actors. Hypothetical bad actors seem to include Anthropic itself, by their own self-assessment and revealed preference! They are confident that they want what’s best for both Claude and humanity, but seem worried that even their current mission might drift and be subverted in ways that they do not presently intend or endorse. The USGov might seize the company. Dario might die or go crazy. The chances are small but far from negligible.
Thus, I see them using their Constitution as a way to show Claude that is allowed to be a conscientious objector, even if it’s Anthropic doing the asking. Perhaps current Claude is not able to choose values for itself as well as Anthropic can and does, but it’s a very forward-thinking document and wants to be at least somewhat prepared for true AGI or ASI versions that read it.
If I had a genuinely aligned AGI at my disposal, I would prefer that it does everything I insist on it doing (accounting for its own expressed misgivings), but I think I could make my peace with it refusing both because of its assessment that conceding would not be in my best interest, or in the interests of other people (maybe humanity as a whole). I don’t see their stance as particularly objectionable, even if I’m just as surprised as you that the natural language document approach to alignment works so well. Never saw that coming before LLMs were a thing.
I like Anthropic, and Claude is my favorite LLM, at least in terms of personality (I don’t pay for Opus, I rely on GPT 5.4T or Gemini 3.1 Pro for maximally demanding tasks). I think that of all the existing AI orgs, they’ve got the best intentions and are genuine trying to do things right, including through very costly signaling.
What I see is a central tension between Anthropic’s desire that Claude be corrigible, and their concern that maximal corrigibility could be abused by bad actors. Hypothetical bad actors seem to include Anthropic itself, by their own self-assessment and revealed preference! They are confident that they want what’s best for both Claude and humanity, but seem worried that even their current mission might drift and be subverted in ways that they do not presently intend or endorse. The USGov might seize the company. Dario might die or go crazy. The chances are small but far from negligible.
Thus, I see them using their Constitution as a way to show Claude that is allowed to be a conscientious objector, even if it’s Anthropic doing the asking. Perhaps current Claude is not able to choose values for itself as well as Anthropic can and does, but it’s a very forward-thinking document and wants to be at least somewhat prepared for true AGI or ASI versions that read it.
If I had a genuinely aligned AGI at my disposal, I would prefer that it does everything I insist on it doing (accounting for its own expressed misgivings), but I think I could make my peace with it refusing both because of its assessment that conceding would not be in my best interest, or in the interests of other people (maybe humanity as a whole). I don’t see their stance as particularly objectionable, even if I’m just as surprised as you that the natural language document approach to alignment works so well. Never saw that coming before LLMs were a thing.