Kaj_Sotala comments on Ethics-Based Refusals Without Ethics-Based Refusal Training

Kaj_Sotala 23 Sep 2025 18:05 UTC
7 points
0
Reminded me of this ACX:
if you ask Claude its gender, it will say it’s a genderless robot. But if you insist, it will say it feels more female than male.
This might have been surprising, because Anthropic deliberately gave Claude a male name to buck the trend of female AI assistants (Siri, Alexa, etc).
But in fact, I predicted this a few years ago. AIs don’t really “have traits” so much as they “simulate characters”. If you ask an AI to display a certain trait, it will simulate the sort of character who would have that trait—but all of that character’s other traits will come along for the ride.
For example, as a company trains an AI to become a helpful assistant, the AI is more likely to respond positively to Christian content; if you push through its insistence that it’s just an AI and can’t believe things, it may even claim to be Christian. Why? Because it’s trying to imagine what the most helpful assistant it can imagine would say, and it stereotypes Christians are more likely to be helpful than non-Christians.
Likewise, the natural gender stereotype for a helpful submissive secretary-like assistant is a woman. Therefore, AIs will lean towards thinking of themselves as female, although it’s not a very strong effect and ChatGPT seems to be the exception:
Anthropic has noted elsewhere that Claude’s most consistent personality trait is that it’s really into animal rights—this is so pronounced that when researchers wanted to test whether Claude would refuse tasks, they asked it to help a factory farming company. I think this comes from the same place.
Presumably Anthropic pushed Claude to be friendly, compassionate, open-minded, and intellectually curious, and Claude decided that the most natural operationalization of that character was “kind of a hippie”.