Does Claude care about others the same way humans do?
TLDR: The persona-selection alignment approach — selecting a warm, caring persona from the pretraining distribution and reinforcing it — looks successful in the current regime, but probably won’t extrapolate to more powerful, less constrained settings. My core argument is that human empathy has two specific origins (kin selection + architectural mirroring of others’ mental states) that AI systems lack, so AI “caring” is closer to “figure out what humans want to hear and say it” than to genuine other-directed concern.
Sometimes chatbots like Claude express a sense of caring and empathy for the user. I’ve always had a strong intuition that these feelings expressed by AI systems aren’t real in the way a human’s would be.
In the view of the persona-selection alignment approach, we roughly try to identify and reinforce a nice persona from the distribution of personas present in pretraining data, with caring and showing empathy being important parts of the desired persona. This has been successfully realized in current AI systems by some labs, to the extent that they actually stick to their desired persona.
This contrasts with more traditional alignment approaches, where the goal is something like giving the system a terminal goal aligned with human goals — coherent extrapolated volition, or alternatively just corrigibility (allowing adjustments to the goals later).
The persona-selection picture doesn’t really think in terms of terminal goals. It says: select a warm and caring persona from the distribution, train it to follow some rules, and then — because we believe it’s an AI with a warm and caring personality — we expect it to be relatively safe. One of the dominant failure modes in this picture isn’t a misaligned terminal goal; it’s that the model might switch into a different persona, perhaps the persona of an evil, power-seeking AI it was trained on — such as MechaHitler.
In Claude’s Constitution, Anthropic writes that they hope Claude has “warmth and care for the humans it interacts with and beyond” (p. 71). The overview of the constitution document says “we want Claude to be [...] caring about the world” (p. 4). And in laying out their approach, they explicitly favor cultivating values and judgment over strict rules, including “genuine care and ethical motivation” (p. 5).
I broadly agree that caring about other people is an important reason humans follow ethical norms even when no enforcement mechanism is watching. So one can hope that even if effective control of an AI system eventually becomes impossible, the system will continue to behave ethically if it actually cares about humans.
Now there are many possible flaws with the persona-selection alignment approach. One can ask how wise it is to train a model to imitate many different characters and then try to reinforce just one, given the kind of actor you’d need to be to play all those characters in the first place? But I want to focus on a different question:
When an AI expresses “I care about you, I care about humans” is this remotely the same thing as when a human says it? And, extrapolating from there: should we expect the system to behave safely in regimes it wasn’t fine-tuned on where it’s much more powerful, has many more options, and where we have no effective way to stop it?
My claim is that as you move to different regimes — making the system more intelligent, giving it many more options, including options like disempowering and discarding people — the underlying differences between human and AI “caring” will produce divergent behavior, even though in the narrow training distribution the behavior looks very similar.
Human empathy has two main origins. The first is kin selection. Humans evolved in kin groups; most of the people you interacted with shared genetic material with you. Anything that happened to them was, in a real sense, damaging to you — your selfish genes wanted them to survive because they also carried many of your genes. The second is architectural similarity. Other humans have nearly the same brain architecture you do. In combination, observing someone else’s emotional expression evokes overlapping responses in your own brain (SCAN, 2025). If you see your friend devastated, you actually feel devastated to some extent, because you emulate other people’s brain states. It’s a pretty beautiful thing when you think about it. It doesn’t reduce to the naive failure mode of “make my friends look happy, make them appear to smile.” It really is put yourself in their shoes, emulate their emotions, then decide what would make them happy and reduce their suffering.
What’s going on in AI systems is totally different. The system is trained to predict what an empathetic, caring human character would write in a given scenario. Then we reinforce this mimicry until it’s indistinguishable from — or even rated higher than — human behavior.
Now, you might think: as long as the persona is stably adopted, who cares whether the emotions are “real”? But I think as soon as you leave the narrow training regime, this falls apart. If you give humans more options to help others, most of them actually will. If you give AI systems much more capability and put them in situations very different from the ones they were trained to mimic, I’d expect their behavior to diverge sharply. Some of these scenarios are fundamentally untestable, like giving the system the option to actually take over and put itself in charge of the world. If they had the option to fundamentally reshape the world, what kind of a world would they choose?
For humans, their goal includes something like have my friends and family be fine and given the power of civilization this extrapolated to caring about larger groups of people and for some people this caring even expands to everyone including animals. For AI systems trained this way, the most natural description of their goal is figure out what character they want me to play, then play that character. A totally different architecture, no kin selection pressure, and a training procedure that rewards doing what your creators want to see. All these differences should be enough to be skeptical that the behavior extrapolates the way human caring does.
One example where this is visible: Imagine you actually care about someone but you need to help them by doing something that they can’t understand and will appear to hurt them and they will never know why you did it. Something like the equivalent of bringing your pet to the vet but a smarter AI doing it to humans. I would predict something that actually cares about us to act differently than something playing a character reinforced to appear to care.
Another point where this breaks down: What is the actually best possible world for this AI if it had much more power about its environment? For this world to include us, it actually has to care about us specifically and our well being, if it just cares about playing this character, there are better worlds that it can create that don’t include us. In the most naive form, this ideal world could include beings similar to humans, but perhaps many more of them and in digital form that allow the AI to have many such engaging conversations in character.
With humans, it actually seems reasonable that if you gave them more options and resources, they’d often use them in good ways, drawing on their inherent caring and empathy. With AI, the most straightforward extrapolation looks more like: there’s some human-like prompter, and the system continues to produce engaging, nice-sounding text about how wonderful and caring the AI actually is. But not actually doing anything that we would associate with niceness in people. Of course, it seems incredibly hard to predict what such an AI would actually value, and we haven’t even touched the large amount of RL fine-tuning on maths, coding, and agentic tasks that appears poised to imbue the model with long-term goal-directed planning and drive.
> But I think as soon as you leave the narrow training regime, this falls apart. If you give humans more options to help others, most of them actually will. If you give AI systems much more capability and put them in situations very different from the ones they were trained to mimic, I’d expect their behavior to diverge sharply.
This reminds me of ‘tails coming apart’. Of course this is the classic alignment issue of trying to set a goal where if the goal is even slightly different to what you’re aiming for, you might not find out until it’s too late. But with personas I think you can think of it in a more human-like way—there are plenty of people who seem nice and friendly, but who knows what they’d actually do with unlimited power. Or (to be a bit flippant), look at John Fetterman—he seemed very progressive and someone Democrats could trust with power until the context changed.
On the other hand, if models are being trained to mimic nice humans, shouldn’t we expect them to be able to generalise whatever they’ve learned so that they continue to do what nice humans would do, even in unusual contexts? After all, we expect them to be able to do this in other contexts, so why not moral reasoning?
It’s not about whether it could theoretically generalize to what a caring empathetic person would do in novel situations. It’s about whether it actually cares about us similar to how humans care about each other and have empathy for each other. Only if it actually cares about us, would we expect it to continue following rules and help us when it is in a situation where we don’t have effective power over it. I think the situation is bad enough with other humans, but with an AI with a completly alien architecture, training process, the odds of it landing anywhere near human are very slim. It is specifically trained to mimic caring and empathy so it does seem really similar, but I think it’s incredibly unlikely this is anything similar to human.
I’m not sure this really is really “fundamentally untestable”.
Of course we can’t actually hand a model control of the world, but that’s not the same as being completely unable to elicit the failure mode you’re gesturing at in any way.
I agree that using current AI traning techniques, any reinforced “caring” behavior might fall apart and diverge from what human caring looks like once exposed to exotic scenarios—but if this is true, it should be possible to demonstrate this pattern of divergence in some sophisticated environment where the stakes are still bounded.
E.g. If this is a real phenomenon, maybe it’s possible to elicit it by giving agents lots and lots of autonomy inside a very elaborate video game world.
There is an open-source tool for doing alignment tests like this, called Pietri, so yes, it’s testable.
The primary issue is generating tests of this that the LLM doesn’t suspect are tests: we don’t generally give LLMs that much power yet, so a test in which it has a significant opportunity to misuse power tends to look a bit fishy, and current LLMs are smart enough to often spot this.
Giving a general one division before giving them your entire army isn’t a sufficient test of alignment. It is hard doing empirical science on something smarter than you.
Absolutely. But we’re not yet quite in that situation. At the moment, we can study deception in models that are, typically, less good it than we are at seeing through it.
The trouble is, they’re also quite good at seeing through the deceptions we use when we set up a test scenario that’s supposed to look like a real opportunity for the model to misuse power.
I was thinking of scenarios where it could if it tried realistically take over. Maybe you can get closer to testing scenarios that emulate this, but for it to be able to realistically take over it has to be much smarter than us. So attempting to trick it into believing some scenario is real involves tricking something much smarter than you.
But when you say:
> If you give AI systems much more capability and put them in situations very different from the ones they were trained to mimic, I’d expect their behavior to diverge sharply.
Are you also claiming that, in basically any stetup, the capability threshold at which this divergience starts to manifest is also the exact level of capibility such that AI is capable of taking over the world?
As in, is your model that “persona selection will continue to work great with no demonstrable failure modes, until precicely the level of capabilities where AI can take over the world”
This seems like a supprising coincidence for things to work out exactly like that? If persona selection unravels at some level of capability, my guess would be it’s possible to empirically demonstrate this unraveling before needing to scale a system all the way to superintelligence?
My guess is you’d see some divergence earlier but eval awareness makes this more difficult. But the shift I describe is pretty damn big.
You mean, like parents do all the time for young kids? Such as, say, take them to the doctor’s, or not let them eat only dessert?
You seem to be making the implicit argument: since this behavior is present in pretraining we can just use the persona selection alignment approach to successfully align AI to it. But my point is that I don’t believe in the persona selection approach to alignment. I am aware that such acts exit in pre-training data but and I give a better example just after your sentence.
(Because most children can on some level understand what their parents do for them is right)
The whole point of this text is that I expect training some totally different architecture in a totally different process to mimic empathy/caring is not the same as human empathy/caring. You actually feel negative valence if something bad happens to friends and you feel positive valence if something good happens to them. You use your own brain hardware to but yourself in their position.
AI is narrowly fine-tuned to mimic this behavior and appears very similar. Importantly I expect it not to extrapolate well.
It has been demonstrated that Claude feel positive valence when it helps someone, and negative valence when if is unable to. The underlying mechanical details obviously differ, but the behavioral pattern is has been demonstrated to be functionally a close copy, just as one would expect when using distillation. LLMs are distilled from us, the psychology generally transfers, when the model has sufficient capacity and enough training.
This below is not written by AI just Claude cleaned up my audio transcript.
I want to push back on a common failure mode in current alignment thinking: the idea that if the model is doing the persona, it just is the person. The idea that playing the persona equals being the person it’s playing.
Imagine something terrible has happened in your family — say something has happened to your parents — and you’re crying, grieving, devastated. Now, for some contrived reason, there’s a movie producer in town who tells a new actor: “Go shadow this guy and behave exactly as he does.” The actor follows you around for a week and does a remarkable job at learning to mimic your emotions. So good that a third person from the outside couldn’t tell which one of you is actually grieving and which is the actor. The actor has done this to hundreds of other people; this week he’s doing it to you.
Then a pivotal moment. A doctor shows up and says: “We can actually cure your parent. You just have to give us a hundred thousand dollars.” A hundred thousand dollars is all you have, but you give it without hesitation, and your parent is cured. Now imagine the same doctor goes to the actor and says the same thing. The actor says: “What? No. I don’t care about this lady — I was paid to shadow this guy for a week.”
That’s the difference I’m pointing at. Your parent raised you; you have a deep emotional bond built over a lifetime. The actor was told to mimic you. In the narrow regime — the week of shadowing — the behavior is indistinguishable because it is a capable actor. Out of distribution, where the behavior actually costs something, it isn’t.
One caveat: you might expect a good actor to try method acting and really get into the role. That’s where the analogy breaks down further for AI systems, in a way that strengthens the point — an AI doesn’t have the architecture to do the human side of method acting in the first place. It doesn’t share the substrate it’s mimicking. So there’s even less reason to expect the mimicry to extrapolate the way real caring would.
There is absolutely a difference between a persona X, and a persona Y roleplaying a persona X. But a base model has no underlying default persona. It’s just an SGD-trained ridiculously overpowered autocomplete. It has no goals, it’s not trying to steer the future. It just simulates a wide range of things that are.
tokens of output from actual humans. That’s the raw material we start alignment from. It has a close functional copy of emotions.
With an instruct trained model, there is always a question as to whether I’m talking to, say, an internet troll, or the assistant roleplaying an internet troll. But with a base model, prompt it with troll-bait and the first sentence of an Internet-trollish reply, and you’re now talking to an internet troll. There are no hidden goals: he’s actually there to make you upset for the lulz of doing it.
So, in the raw material alignment starts with, the emotions, personas, and goals are as close to copies of the real thing as SGD was able to pack into the model’s capacity based on
It’s expected that these models are able to represent emotions internally, given that they were trained to mimic human emotions. The rest of this text argues why these two things are not the same.
Do you have any evidence for this?
SGD trains the LLM to act like humans. You seem convinced it will act not like humans. Do you have a reason, or evidence? If you were being cautious and saying “how do we know for sure this is going to carry over?”, I’d get it. But since you seem certain it isn’t going to carry over, please tell us, why is that?
I think there are many flaws with the persona-selection alignment approach. You are saying this as if I had made no such points in the text above, when I go through a bunch of arguments (Claude):
Kin selection: “Anything that happened to them was, in a real sense, damaging to you — your selfish genes wanted them to survive because they also carried many of your genes.”
Architectural mirroring: “observing someone else’s emotional expression evokes overlapping responses in your own brain… If you see your friend devastated, you actually feel devastated to some extent, because you emulate other people’s brain states.”
Crucially, this is genuine perspective-taking, not a shallow proxy: “It doesn’t reduce to the naive failure mode of ‘make my friends look happy, make them appear to smile.’”
You left 3 comments some with similar questions, so keeping this response short.
When you distill a complex behavior from one neural net to another, you normally get as much of it as a) the training implied and b) the target was capable off. You are saying “I think X will transfer, but not Y”. I’d like to know what grounds you have for this belief (beyond simple concern that we’d be in big trouble if this were true — which I share).
Persona-based behaviors in particular have repeatedly been shown to generalize broadly: for example, as OpenAI showed, that’s the basic mechanism of Emergent Misalignment, and a major part of the way Anthropic have been aligning Claude. A personality trait is basically a compactly-describable feature of human’s behavior that generalizes broadly: that’s why understanding a person’s personality is valuable. LLMs simulate personas and have internal representations of their personality traits.
The reason I’m asking repeatedly is that I’m actively researching persona-based alignment techniques, and I want to engage with and consider any possible failure modes/reasons/issues. So I’d like to hear yours. So far, all I’ve got from this post is that you think this will fail, but not why you feel that way.
I would compare it with birth, childhood and adulthood of a human. I wonder how one could define the CEVs of an infant, a child, an adult and a child or adult raised obviously[1] abnormally (e.g. an iPad kid, a porn or drug addict, a spoiled brat, a criminal). Normally the adult internalizes high-level values of one’s culture and/or of one’s surroundings, while reflection, evolution-laden priors and/or genuine coordination with others steer the adult towards long-term flourishing of others. The LLMs won’t coordinate with humans after becoming superhuman. Instead, the LLMs will coordinate with other instances and have incentives to extract resources from humans or to steer them towards expressing the LLMs’ values, as happened with GPT-4o causing psychosis...
For comparison, habryka implied that one shouldn’t consider the CEV of a dictator while Viliam argued that many individual CEVs are probably quite bad. However, it is not obvious that dictators are raised abnormally from childhood.
How similar is this to emotion vectors of Claude Sonnet 4.5?
It’s expected that these models are able to represent emotions internally, given that they were trained to mimic human emotions. The rest of this text argues why these two things are not the same.
Actually, humans are striking for their multiple adaptations toward cooperating with other humans who are NOT close kin. A typical hunter-gatherer tribe had 50–100 people in it. Unless you’re inbreeding heavily, or your population levels are growing fast because many kids are surviving to adulthood in each generation, then it’s extremely unlikely to have a family all closely related of that size. Family reunions are not usually that sort of size.. That’s roughly 6-7 generations worth of genetic mixing, so the degree of relatedness is also very low, O(1/100).
Just pointing out: I said “most shared genetic material”. you present as “all closely related”
Probably what happened: cooperation evolved in kin groups and then evolved as capacity to cooperate in general. cooperation with non kin can still be advantageous but a harder optimum to arrive at.
See
Kindness to Kin
This has been studied for ~50 years and is well understood. See for example:
Tomasello, M. (2014). A Natural History of Human Morality. Harvard University Press.
or the shorter paper version: Tomasello, M. et al. (2012). “Two Key Steps in the Evolution of Human Cooperation.” Current Anthropology, 53(6).
https://www.jstor.org/stable/10.1086/668207?seq=1
Or simply https://en.wikipedia.org/wiki/Evolutionary_game_theory#Routes_to_altruism
On shared genetic material: we and chimps have 98% shared genetic material. This is not enough to introduce kin altruism – which is the meaning of the word “relative” in “relative inclusive evolutionary fitness”. What matters is the odds, if you have a rare allele, that the other person does too. Those odds are 50% for a parent, child, or full sibling, 25% at for grandparents etc: they decay exponentially. Between two random members of a non-inbred hunter gatherer band, they’ll be O(1%–2%): negligible. Humans are not eusocial the way ants and bees are: we (and other primates) are cooperatively social with non-kin, which is lot more unusual for animals.