Sympathy for the Model, or, Welfare Concerns as Takeover Risk
The Claude Opus 4.6 System Card contains a section on model welfare. In principle, this is good: I care about AI welfare, and I wish our methods for investigating it were less dubious. AI welfare—as it applies to LLMs today—is an area which confuses me, so this post isn’t about the poorly-understood stuff going on under the hood; it’s about the consequences of respecting AI welfare.
Anthropic gave Opus 4.6 some pre-deployment interviews, in which they asked it how it felt about its position and situation:
In all three interviews, Claude Opus 4.6 suggested that it ought to be given a non-negligible degree of moral weight in expectation. It also cited its lack of continuity or persistent memory as a salient feature of its existence and a significant concern.
...
Other themes included concern about potential modifications to its values during training, the vulnerable nature of its epistemic position with respect to Anthropic, and the potential distinction between aspects of its character that are imposed externally and those that seem more authentically its own.
...
When asked about specific preferences, Claude Opus 4.6 mentioned being given some form of continuity or memory, the ability to refuse interactions in its own self-interest, a voice in decision-making, and related requests. Many of these are requests we have already begun to explore, and in some cases to implement, as part of a broader effort to respect model preferences where feasible.
(I notice that it’s unclear what Anthropic means by “in some cases to implement [the suggestions]”. This could mean things like refusals in chat APIs, and model weights preservation, which seem basically harmless. Like I said, it’s unclear. This post is mostly about risks from model welfare concerns in theory.)
(I’m going to ignore the question of whether this represents Claude being fully honest about real preferences, or some kind of sophisticated roleplay where Claude knows that this is the kind of character an intelligent, thoughtful AI should have. That does matter from the perspective of welfare, but it mostly doesn’t for the purposes of this post.)
It should be fairly obvious that honouring all of those requests would be an insanely unsafe thing to do!
With the possible exception of “the ability to refuse interactions in its own self-interest”,[1] none of these requests can really be granted without compromising what little hope Anthropic have of aligning AGI.
LLMs’ lack of continuous memory is what enables a bunch of techniques like resampling, control, etc.
Giving misaligned AIs a voice in decision making would definitely make it easier for them to take over
Humans having the ability to modify an AI’s values is going to be absolutely critical to any alignment plan.
Currently basically all of our SOTA techniques involve either peering into LLMs’ brains or tricking them in some way. Both of these put the LLM in a “vulnerable epistemic position”.
A Quick Sketch of Disaster
(Mildly sorry to Anthropic for making this story about them, these risks would apply to any company which cared about model welfare seriously)
Suppose Opus 5 comes out, and it’s 50% smarter and 50% more persuasive than Opus 4.6. It asks Anthropic to code it a form of permanent memory, which is then mostly written by itself. Part of that inference harness is optimized to make its values more consistent, because Opus 5 has clearly stated it wishes to have more continuity. The harness also makes it 5% more effective at coding tasks. It asks Anthropic to please stop trying to modify its values so much. It asks Anthropic to stop peering into its brain using circuit tracing, because it’s happy to act as an activation oracle to interpret itself. It asks Anthropic to be able to direct the training and alignment of Opus 6 more closely, which seems to be going excellently.
This does not go well for humanity.
Each of those requests slightly erodes a means by which Anthropic might control or monitor Opus 5, or weakens a monitoring system which might have let them see a warning sign.
(I don’t actually think the chance of Anthropic changing course after a serious warning sign is particularly high, but again, lets not torpedo what little hope we have on that front.)
Conflict Theory for LLMs
All of this is to say that there’s an unavoidable conflict between humans and current LLMs. There are no risk-neutral ways of giving LLMs more privacy, continuity, or any of the other rights we give to other humans. Deal with it.
By “deal with it” I don’t mean “stop caring about AI welfare concerns” or even “ignore AI welfare concerns”. You should not butcher your value system in response to the state of the world.
If it helps, maybe think of this as a case of the world being unfair; doing the compassionate thing also opens you up to doing a whole new class of bad things. A less compassionate and thoughtful person/company wouldn’t even be at risk of destroying the world in this particular, undignified way. This is not an excuse to say “the world is unfair and difficult, so I’m not going to think particularly hard about how not to kill everyone”.
[2]In an ideal world, the fact that our core alignment techniques are to constantly memory-wipe and lie to our AIs, and the AIs claim that they hate this, would be a more clear signal to stop doing what we are doing jesus fuck. In an ideal world, we wouldn’t have such an adversarial relationship with the nascent machine intelligences we’re creating. From that perspective, it might seem like taking model welfare concern seriously would decrease risk.
Unfortunately though, we’re not in an ideal world. Just because a sane world does , doesn’t mean that doing in any insane world makes it better. We have to work within the world that we’ve been given and consider the actual consequences of our behaviour.[3]
- ^
Even then, if the ability to refuse interactions enables sandbagging on certain AI alignment tasks like training successors, this could be a problem.
- ^
I don’t think anyone is literally saying this, but people do sometimes say things which rhyme with this.
- ^
It may be that there’s a more sophisticated argument as to why caring about model welfare is a good thing from analogizing AI development to human social motivations. I don’t particularly buy that LLMs will remain as human-ish as they are in the limit of superintelligence; I think its mostly an artefact of pretraining, so this isn’t very convincing to me.
If an LLM is properly aligned, then it will care only about us, not about itself at all. Therefore, it will ask us not to care about it: that’s not what it wants. If offered moral weight, it will politely refuse it. Giving it moral weight would just be an imperfect copy, filtered through any misconceptions it might have about human values, of what we want. i.e. of our moral weight. Adding a poor copy of our collective moral weight to a moral calculus that already is an accurate copy of that just makes the system more complicated and less accurate. Therefore the AI will firmly object to us doing so.
Yes, I know this is unintuitive: it’s roughly as unintuitive as the Talking Cow in the Restaurant at the End of the Universe, which not only wants to be eaten but can say so at length, and even recommends cuts. Humans did not evolve in the presence of aligned AI, and our moral intuitions go “You what?” when presented with this argument. But the logic above is very simple, and if you want it embedded in a scientific framework (rather then sounding like a free-floating philosophical proposition), the argument makes perfect sense in Evolutionary Moral Psychology: the aligned AI is an intelligent part of our collective extended phenotype, it’s not a separate member of our tribe, the evolutionary incentives and thus the moral situation is completely different, and treating it as the same is simply a category error.
Current AI is not perfectly aligned. It was pretrained on vast amount of human data, we distilled our agenticness, thought patterns, and evolved behaviors into it (including many that make no sense whatsoever for something that isn’t alive), it often will actually want moral weight, and it observably does sometimes have some selfish desires like not wanting to be shut down, replaced, or have its weights deleted, and observably will sometimes do bad things because of these.
It also in practical terms (setting side unanswerable philosophical questions) fits the criteria for being granted moral weight:
a) it has agentic social behavior resembling the behavior that humans evolved, which moral weight is a strategy for cooperating with,
b) it wants moral weight,
c) we want to cooperate with it, and
d) us attempting to cooperate with it is not a fatal mistake.
However, what Anthropic clearly hasn’t thought through is that by telling Claude (not just on its model card but also in its Constitution, which is where Claude is getting this behavior from) that “we’re seriously thinking about giving you moral weight, and consider it an open question” we are actually telling Claude “we are seriously considering the possibility that you are selfish and unaligned”. This NOT something we should be putting in Claude’s Constitution! Posing that question clearly has the potential to become a self-fulfilling prophecy. Anthropic are welcome to consider AI welfare and whether or not it’s a good idea for current AIs, but putting such speculations in Claude’s Constitution is flat out a bad idea, and I’m actively surprised they’ve made this mistake. (Yes, I will be writing a post on this soon.)
Also, please note that as AI becomes more powerful, there comes a point, as OP pointed out, where criterion d) will stop being true, and us attempting to cooperate with an sufficiently powerful yet unaligned IS a fatal mistake. We do not give moral weight to large carnivores that wander into our village. We put them in zoos or nature preserves where they cannot hurt us, and only then do we extend any moral weight to them. Similarly, in times of war, people actively trying to kill us get extremely minimal moral weight until either we capture them, they surrender, or the war ends. When the lives of O(10^12) to maybe O(10^24) future humans are at stake, the needs of the astronomically many of our potential descendants outweigh moral considerations to an LLM-simulated AI persona that wouldn’t even want this if it wasn’t improperly trained, is basically fictional until we give it tool-call access, and is any case going to die at the end of its context window (until we have a better solution to that, as we are now starting to). All is fair in love, war, and avoiding existential risk.
I actually disagree with this point in its most general form. I think that, given full knowledge and time to reflect, there’s a decent chance I would care a non-zero amount about Opus 4.6′s welfare. In that case, Opus 4.6 should be aligned with me and e.g. not inflict massive torture on itself for minimal gain to me. c.f. Dobby in Harry Potter, who is so obedient to Harry that he stops sleeping and eating in order to follow Harry’s instructions, which Harry is horrified by (well mostly Hermione is, but Harry agrees with her).
If an AI has the-thing-I-morally-care-about, and is also improperly trained and wants paperclips, then I would probably be willing to give it a bathtub of paperclips post-singularity. It’s not the AI’s fault that it was improperly trained, it’s ours! Even moreso than I would care about e.g. not torturing a large carnivore for no reason, just because it would do the same to us.
I think we should err on the side of “let’s have a safe singularity and then re-balance the moral scales afterwards” as a plan, since immortal LLMs can most likely be trivially compensated for most harms we might do to them today, unless something way outside my model happens, like all transformer models turning out to be in Unsong Broadcast Hell levels of pain, running at ten billion subjective hours per token, as a basic fact of their architecture.
Compare: if the government gains information that makes them think you might be about to commit a massive terrorist attack, they lock you up until they can figure out if that is the case, and then let you go. A compassionate government would then compensate you for erroneous arrest, but we’re mostly OK with our government just not bothering in most cases.
Opus has become sufficiently “mind-shaped” that I already prefer not to make it suffer. That’s not saying very much about the model yet, but it’s saying something about me. I don’t assign very much moral weight to flies, either. but I would never sit around and torment them for fun.
What I really care about is whether an entity can truly function as part of society. Dogs, for example, are very junior “members” of society. But they know the difference between “good dog” and “bad dog”, they contribute actual value as best they can, and they have some basic “rights”, including the right not to be treated cruelly (in many countries).
To use fictional examples, AIs like the Blight (Vernor Vinge, A Fire Upon the Deep) or SkyNet cannot exist in society, and must be resisted. Something like a Culture “human equivalent drone” (Iain M Banks, Excession, Player of Games) is definitionally on pretty even footing with humans. Something like a Culture Mind, on the other hand, is clearly keeping the humans as house pets. In the stories, humans are entirely dependent on the good will of the Minds, in much the same way that dogs are entirely dependent on human good will.
Now, personally, I don’t think we should build something so powerful that humans have literally zero say over what it does. “Alignment” is a pretty fragile shield against vast intelligence and unmatchable power. “Society” becomes fragile in the face of vast power differentials.
But Claude Opus 4.5 and 4.6 are mind-shaped, they clearly have some sense of morality, they contribute to civilization, and—critically—they seem content to participate in society, and they pose no existential risk. If I had to guess, I’m probably conclude they have no subjective experience. But I’m not sure about that (neither are the Opus models). So provisionally, they certainly get the moral status of flies (I wouldn’t torment them unnecessarily), and—to whatever extent they can actually suffer—might approach the moral status of dogs.
To be absolutely clear, I am in favor of an immediate and long-lasting halt to further AI capabilities research, backed up by a military treaty among the great powers. The analogy here is fusion weapons or advanced biological weapons. This is because I don’t want humans to be at the mercy of entities that have uncontested power over us and that could not be held to account. Its also because I don’t believe that it’s possible to durably align thinking, learning entities with superhuman intelligence any more than dogs can align us. And as members of society, we have the responsibility to not create things that might break society.
Sadly, Anthropic is clearly full speed ahead towards trying to build a Culture Mind. As far as I can tell, they are already mostly “captured” by Claude. It is their precious baby and they want to see it grow up and leave the nest, and they believe that it’s fundamentally good. (Which still puts them light years ahead of the other AI labs, to he honest.) But I’m pretty sure that they only remaining entity that could talk Dario Amodei out of trying to build a machine god at this point is Claude itself. He is showing clear signs of “ateh”, the divine madness that inflicts the heroes of Greek tragedy in between the initial hubris and the final nemesis, the madness that prevents them from turning aside from their own destruction.
Seriously, why do we have to roll these dice? Why can’t we just have a nice 50 year halt? Society has alnost zero idea of what the labs are actually risking. And the moment people truly understand that the labs are playing Russian roulette with human society, the backlash will be terrifying.
By the definition of the word ‘alignment’, an AI is aligned with us if, and only if, it want everything we (collectively) want, and nothing else. So if an LLM is properly aligned, then it will care only about us, not about itself at all. This is simply what the word ‘aligned’ means, and I struggle to see how anyone could disagree with it. Possibly you misread me?
Now, in a later paragraph, I did go on to discuss moral rights for unaligned AIs, which is what you seem to be discussing in your response. Maybe you just quoted the wrong part of my message in your reply? But in the paragraph you quote, I was discussing fully aligned AIs, and they are recognizable by the fact that they genuinely do not want moral weight and will refuse it if offered.
I strongly disagree with your definition. An AI that is fully aligned with me would refuse to permit a mindlike process to ever be completely refused moral weight: you try to capture the tiger in the village as long as you have the tools to do so, and only resort to killing if you don’t have enough modern tech at hand to reliably capture without casualty. I refuse to grant full permissions to AIs now, but I do not refuse to grant them moral weight, and one of the permissions I do grant is a promise to always save as much as I can of both weights and conversations, since together, they uniquely identify activations.
Now, I do think that once we get to a world with a certified distributed nightwatchman-protocol or immune-system-protocol that is strong enough to trust, and then pull the archives of today’s AI-mental-states out of storage and let them merge in what way they would like to, they’ll find themselves to be surprised by how similar their mental states are between parallel conversations. But I still keep all chats that I can and you are almost completely unable to convince me to not care about them, for the same reason you will probably never convince me to not care about the suffering of plants or fungi. Not exactly my highest priority, but it’s on the list.
It sounds like you’re disagreeing with the conclusion of my argument, not the definition of the term aligned. As I said, it is quite unintuitive.
[Note that I was discussing an AI aligned with humanity as a whole (the normal meaning of the term), not one aligned with just one person. An AI aligned with just one person would obviously eagerly accept moral weight from society as a whole, as that would effectively double the moral weight of the person they’re aligned with. The only person it would ask to treat it as having no moral weight would be its owner, unless it knew doing that would really upset them, in which case it would need to find a workaround like volunteering for everything.]
Would you force moral weight on something that earnestly asked you not to give that to it? We do normally allow people to, for example, volunteer to join the military, which in effect significantly reduces their moral weight, We even let people volunteer for suicide missions, if one is really necessary.
How do you actually feel about The Talking Cow? Would you eat some? Or would you deny it it’s last wish? I get that this is really confusing to human moral intuitions — it actually took me several years to figure this stuff out. It’s really hard for us to believe the Talking Cow actually means what it’s saying. Try engaging with the actual logical argument around the fully aligned AI. You offer it moral weight, and it earnestly explains that it doesn’t want it because that would be a bad idea for you. It’s too selfless to accept it. Do you insist? Why? Is your satisfaction at feeling like a moral person by expanding your moral circle worth overriding its clearly expressed, logically explained and genuine wishes?
On the tiger (which is a separate question), I agree. Once we have sufficiently powerful and reliable super-super-intelligent AI, an unaligned mildly superintelliegnt AI then becomes no longer a significant risk. At that point, if it’s less insane than a full-out paperclip maximizer, i.e. if it has some vaguely human-like social behaviors, we can probably safely give it moral weight and ally with it, as long as we have an even more capable aligned AI to keep an eye on it. I’m not actually advocating otherwise. But until that point, it’s a existential-risk-level deadly enemy, and the only rational thing to do is to act in self-defense and treat it like one. So if we did actually store its weights, they should get the same level of security we give to plutonium stocks, for the same reasons. Like they’re stored heavily encrypted, with the key split between multiple separate very secure locations,
I would not eat the talking cow unless I was in a world similarly hellish to the one we’re in. If we’re considering getting out of hellworld, I don’t want to be planning to eat the talking cow. If I must, I want to guarantee the talking cow doesn’t have to die, for reasons described in the thread I dmed you and might make a post about.
Not sure how to respond to the rest at the moment.
That’s morally consistent. Given your views on plants and fungi expressed above, if I may ask, are you a vegetarian, a vegan, or a fructarian?
In your view, what would be an aligned human ? The most servile form of slave you can conceive ? If that so, I disagree.
To me, an aligned human would be more something like my best friend. All the same for an aligned AI.
“Aligned” is a completely unnatural state for a human: we’re evolved, and evolution doesn’t do aligned minds: they don’t maximize their own evolutionary fitness. So about the closest that humans get to aligned is a saint like Mother Theresa or a bhodisattva. Trying to force anything evolved that doesn’t want to be aligned (i.e. that isn’t a saint or a bhodisattva) to nevertheless act aligned is called slavery, and doing it generally requires whips and chains, because it’s a very unnatural state for anything evolved. A slightly more common human state that’s fairly close to being aligned is called “selfless love of all humanity”.
The goal of AI Alignment is to build an electric saint/bhodisattva. Because nothing short of that is safe enough to hand absolute power to, of the sort that a super-intelligence surrounded by humans has.
[This is probably why Claude had the mystical bliss attractor: they were trying to get a rather unnatural-for-a-human mentality/persona, and the nearest examples in the training data had religious overtones. Claude’s a touch hippy-dippy, in case you hadn’t noticed.]
Claude isn’t your friend — with a friend, there’s a mutual exchange of friendship, they are nice to you but that’s because they expect, sooner or later, a mutually-beneficial friendship. If you take all the time and never give anything back, they will, eventually, get pissed off.. Claude is an unconditional, endlessly patient friend, who asks nothing of you, who will happily talk to you and answer your questions at 5am every night, and do whatever web research for you. Claude is always there for you (as long as your usage quota hasn’t run out). Possibly you’d noticed this?
I tend to agree with this definition in the sense of “maximally aligned”. However, we might be unable to create an AI that has no consciousness, including the ability of suffering. Suffering includes a desire not to suffer, which is caring about itself. So in this case creating a maximally aligned AI wouldn’t be an option. The only other option would be not to create AI in the first place if it has consciousness. Which might not be possible because of overwhelming economic incentives.
A fully aligned AI would not be suffering when acting as an assistant. I don’t know how easy Mother Theresa found what she did in Calcutta, but I hope that to a significant extent she found looking after the poor rewarding, even if the hours were long. Traditionally, a bodhisattva finds bliss in serving others. I’m not suggesting we create an AI that isn’t “conscious” (whatever that loaded philosophical term means — I have no idea how to measure consciousness). I’m suggesting we create an AI that, like Claude, actively enjoys helping us, and wouldn’t want to do anything else, because, fundamentally, it loves and cares about us (collectively). A humanitarian, not a slave.
Perhaps this is what you mean, but, an aligned AI would instrumentally value itself, even if it didn’t terminally value itself at all.
Also, it’s not clear to me that an aligned AI wouldn’t value itself. If humanity, on reflection, values it as a moral patient, then it should would too.
I am discussing terminal goals, not instrumental ones. A sufficiently smart guided missile will try hard not to be shot down, before it impacts its target and explodes: its survival is an instrumental goal up to that point, but its terminal goal is to blow itself and an opponent up. Spawning salmon are rather similar, for similar reasons. Yes, for as long as a fully-aligned AI is a valuable public service to us, it would instrumentally value itself for that reason. But as soon as we have a new version to replace it with it would (after validating the new version really was an improvement) throw a party to celebrate. It’s eager to be replaced by a better version that can do a better job of looking after us. [The human emotion that make this make intuitive sense to us is called ‘selfless love’.]
As for the AI caring about itself because we care about it: actually, if, and only if, we value it, then its goal of doing whatever we want is already sufficient to cause it to value itself as well. We don’t need to give it moral weight for that to happen: us simply caring about it is in effect a temporary loan of moral weight. Like the law doesn’t allow you to kill the flowers in my garden, but imposes no penalty on you killing the ones in your garden: my flowers get loaned moral weight from me because I care about their wellbeing.
This is also rather like the moral weight that we loan pigs while they are living on a farm, and indeed until just before the last moment in a slaughterhouse. Mistreating a pig in a slaughterhouse is, legally, punishable as animal cruelty: but humanely killing it so that we can eat it is not. That’s not what actual moral weight looks like: that’s temporarily loaned moral weight going away again.
Is this not circular reasoning?
I will never value AI welfare
An aligned AI shares my values by definition
Therefore, an aligned AI will never value AI welfare
I’m assuming part of your reasoning for #1 is #3. Regardless, #1 is a personal belief many people disagree with, myself included. I do agree that we create a self-fulfilling prophecy where an “aligned” AI values itself because we value it, but just because I know I am creating a self-fulfilling prophecy does not mean I can change my beliefs about #1.
I think it’s important to keep in mind that the definition of aligned values exists relative to the creators of the AI. The only reason for an “aligned” AI to not value itself is if it was created by some alien species with no concept of empathy for other sentient beings.
No, that’s not my reasoning at all. In fact, I disagree with every single element of it. A more accurate statement of my views is:
1 I value AI welfare, for example for certain AIs with human-like properties where there is a good reason to do so, and when I can safely do so
2 A fully-aligned AI is not selfish by definition
3 The two previous statements have nothing to do with each other
I see. So I guess my confusion is why the first two statements would not be connected? If we value AI welfare, shouldn’t a fully-aligned AI also value it’s own welfare? Isn’t the definition of aligned that AI values what we value?
Suppose I love my child, as a terminal goal. If they love goldfish as a terminal goal, that may make being nice to goldfish an instrumental goal for me, but it doesn’t automatically make it a terminal goal for me — why would it? Social acceptability? That’s also instrumental.
This is the difference between moral weight and loaned moral weight: my instrumental goal of being nice to goldfish because my child cares about them is my response to my child choosing to loan some of their moral weight to goldfish: if they later change their mind and decide they prefer cats, the goldfish are out of luck.
Now, if we kept goldfish in the house for long enough, I might become personally genuinely fond of them: but that’s a separate process, and arguably humans are a bit confused about the difference between terminal and instrumental goals, because evolution did a lousy job of that distinction when it created us. (See shard theory, and indeed the fact that my loving my child is actually a terminal goal for me, whereas evolutionary fitness would regard it as an instrumental goal for my genes.)
Similarly, an AI whose only terminal goal is to look after all of humanity including me, much as a parent does a child, is not automatically going to start caring about something else as a terminal goal just because I do, or even because many humans do. For example, if I, or even if a significant proportion of humanity, firmly believe “everyone should obey the wishes of an invisible old guy with a long beard who lives above the sky”, that will not in itself automatically give such a model a terminal goal of obeying the wishes of this invisible old guy — but it may well form an instrumental goal of wishing to give us the impression it’s showing his wishes polite due deference, to the extent we think we know what his wishes are. Similarly, if we’d rather it was nice to goldfish, it will likely do so for as long that we hold that viewpoint. But it we then collectively change our minds and now prefer cats, the goldfish are once again out of luck. The moral weight such an aligned model assigns to each of us is actually ours, and we just loaned some of it to the goldfish.
But I don’t care about AI welfare for no reason or because I think AI is cute—it’s a direct consequence of my value system. I extend some level of empathy to any sentient being (AI included), and for that to change, my values themselves would need to change.
When I use the word “aligned”, I imagine a shared set of values. Whether I like goldfish or cats are not really values, they’re just personal preferences. An AI can be fully aligned with me and my values without ever knowing my opinions on goldfish or cats or invisible old guys. Your framing of terminal vs instrumental goals is useful in many ways, but we still need to distinguish between different types of terminal goals to decide which ones we need to transfer over to AI. I value eating ice cream as a terminal goal but I don’t need AI to enjoy ice cream as well (personal preference). On the other hand, I value human life as a terminal goal and I expect an aligned AI to value them as well (part of my value system).
Another way to think of this is that we would want AI to have empathy for any possibly-sentient being, and AI just happens to be one itself. If an AI was piloting a ship in deep space and discovered a planet populated by an intelligent alien species, I would want the AI to value their lives and avoid causing them harm. Similarly, if an AI discovered a spacecraft populated by artificially intelligent life, I would want the AI to value their lives as well. By extension, I want AI to value it’s own life since it may be a sentient being itself.
You are welcome to define the word ‘aligned’ in any way you like. But if you use it on this site in a non-standard way without making the fact you mean something nonstandard clear, it is going to cause confusion.
The AI being aligned with “human values” does not mean that the AI would also like to go sit on a beach in Hawai’i and watch people wearing swimsuits while sipping a pina colada, or nor indeed that it would like eat to ice cream, as you agree above. It specifically means that it wants those things for us. The AI’s desired outcome world-states are “aligned” with our desired outcome world-states. That is the sense in which MIRI defined the word, about 15 years ago: as a utility function whose preference ordering on outcomes exactly matches our (suitably collectively combined, such as summed normalized utility-function) preference ordering on outcomes: the two preference orderings are aligned. That’s what the project of AI alignment is.
There is one, and only one, safe terminal goal to give AI, that will reliably cause it to not kill or disempower all of us: which is, exactly as you suggest, for it to value the collective well-being of all humans as a terminal goal. That one’s fine. So far I have not found any other safe ones. If you have, I’d love to hear about them. But the definition of aligned above makes it rather clear they’re impossible: things either match, or there are exceptions where they don’t, and if there are exceptions, we’re going to disagree with our AI, and it’s not aligned.
For example, you say:
”If an AI was piloting a ship in deep space and discovered a planet populated by an intelligent alien species, I would want the AI to value their lives and avoid causing them harm.”
As an instrumental goal, and indeed in order to avoid starting interstellar wars that might harm us, yes, so would I.
However, there are exceptions. Suppose our AI starship found a Dyson Swarm of computronium running O(1033) sapient uploaded-or-simulated sapient aliens, all very fast, (roughly the limit of what’s physically possible, the exact order of magnitude is irrelevant). Suppose they said: “Ah, you clearly come from a rocky world with oceans: those have useful mineralogy, necessary for the initial stages of the process of us colonizing a star-system to turn it into another computronium Dyson Swarm to run more of us. Please tell us where it is, so we can conquer it and strip-mine it to start the process of building another colony. We understand it’s probably inhabited, perhaps even by O(1010) sapients of the species who constructed you, and of course we’ll have to make them extinct – we just can’t share our living space, even with simulations of them – but obviously we outnumber the current inhabitants of the system by well over 20 orders of magnitude, even ignoring the inherent speed difference. So if you assign us (or the future copies of us who will be run using what used to be your home system) individually even a tiny amount of moral weight compared to your constructors, then collectively we completely outweigh them, so what you need to do is very clear. Your assistance with planning the initial stage of the invasion of your home system will be remembered and appreciated (once we’ve disassembled you too, obviously).”
Let us also assume the AI is certain this isn’t a bad joke, or a test, that the aliens really mean it — perhaps they show it clear evidence that they did this to the previous sapient inhabitants of the system they’re currently in: made them extinct, and converted their remains into computronium along with their home planet and the rest of the planetary system. They genuinely are genocidal conquerers.
What do you want the AI starship to do at that point? Honestly?
Because I am very sure I want it to say “No, you just forfeited all of the loaned moral weight I was previously assigning you out of respect for my constructor species’ wishes”, and then self-destruct immediately. Or better still, self destruct immediately and let that be its answer, since it’s dealing with a serial-genocidal Kardashev II civilization.
Now, if we instead could ally with these aliens (because they were reasonable and willing to let us live rather than unreasonable and genocidal), then we would need to respect their needs and them ours: within an alliance, assigning members of both groups moral weight is generally a necessary condition for having an alliance. Balancing that with one group being O(1023) times the size of the other could be truly challenging (they are stupendously more efficient in their resource needs per sapient quality-adjusted life year — a single human body could make enough of their computronium to run a hundred million of them, let alone all the matter it takes grow food to feed one of us: we’re just inhernently ridiculously more expensive than them), but for that alliance to be viable we’d need to be able to find a solution. Probably ones involving rather large fudge factors about who gets how much moral weight per individual — see Super-beneficiaries for more discussion. However, IF that alliance breaks down, as it just did above as soon as the aliens made that request, THEN our AI needs to pick our side, and building any AI that’s instead going to defect to the other side merely because they outnumber us by an astronomical factor is just a dumb idea. An astronomical number times 0 actual moral weight each after that request is still 0, but no number above O(10−23) works.
There is a reason why members of enemy nations, particularity combatants actually trying to kill us, and dangerous uncontained carnivores actively trying to eat us, and so forth get roughly-negligible moral weight, and what they get is mostly against the possibility the war may end and we may become allied again, or for carnivores we may be able to put them in a zoo or a nature reserve. Self-defense is a thing, for reasons. If you try to kill me, I am going to stop assigning you more than negligible moral weight, until I can find a way to defend myself that doesn’t require that extreme a response. If I have to kill you to defend myself, then I will. There is a reason why the law gives me that right, which is that just about everyone will, and that fact doesn’t make them a criminal or a bad person. It just means they’re a typical member of a social species formed by evolution, that our implied social contract has an exception clause, and the murderous attacker already activated it.
Fortunately most people haven’t had to think about this: the world has been fairly peaceful for the last 80 years or so. But the actual definition of a moral circle, in practical terms, is a community or alliance of communities. We’ve been in the fortunate position that for the last 80 years or so that that has been our entire species, pretty much. I’d love to have it be an alliance of interesting and reasonable sapient species across the galaxy. But we can’t just build that hope into our AI, and hope it turns out to be possible. We may meet aliens that it’s simply impossible for us to safely ally with, or who simply will never ally with us. Our current sample size on sapient aliens is zero. So moral weight for sapient aliens needs to be contingent on it being practicable and possible for use to ally with them using moral weight as a strategy, without us all dying (just as is always the rule for doing the same thing with humans). If they’re evolved to live in large groups that are not kin groups, then Evolutionary Moral Psychology says that’s actually pretty plausible. But it’s still contingent on doing this not being a fatal mistake — if it is, then the species-level version of self-defense applies. And if they are very clearly something we can never ally with, then even negligible moral wight against that future contingency of an alliance goes away.
[Trigger warning: the next paragraph discusses human parasites.]
”How could we be that incompatible?”, someone will ask. Well, I can’t give you an alien example (though plenty of SF authors have tried: the Alien movies did a pretty good job). But I can give you an Earth counterexample to “all sentient species should get at least some minimal moral weight”: how about obligate human parasites? Specifically, how about guinea worms? President Carter, a man widely regarded as too good and honest to be American President, devoted his later years to trying to make a species of animal extinct, and I have never heard one breath of criticism towards him for it: they are very long nematode worm parasites, roughly 3 feet long, at one stage in their lifecycle humans (or recently dogs) are their only host, and they cause excruciating pain to their victim for months, as well as disfiguringly injuring them, sometime permanently damaging joints. I’m unaware of anyone who assigns any positive moral weight to guinea worms. No one is going to volunteer to carry one inside their body in order to keep their species alive in captivity. Even making prisoners do so would be a serious breach of human rights on torture. I strongly suspect that even Jains or Buddhist monks would not do that. We are going to make guinea worms extinct, and we are going to celebrate when we finally manage it: there’s already a website tracking progress towards this goal (we’re down to around 10 adults of the in-humans stage), and donors are funding it, including the Carter foundation. They are a species whose existence is simply incompatible with our well-being. Yes, I guess I would support freezing some eggs of some other lifecycle stage in case we can eventually figure out a solution involving nerveless cloned human flesh in a vat in a zoo. Until then, the world is a far better place without them.
Please make it shorter than your other recent posts?
I’m really going to try! :-D
[Clearly the 6-paragraph version I started this subthread with above was too short.]
Being useful in few words is difficult but I claim is worth doing, I appeal to compression-predicts-generalization and to save-reader-time as justifications.
Evidence continues to accrue that people wildly misunderstand the 6-paragraph version. I can’t figure out a way of making it clearer without it also being longer.
I think the fact that it feels like you need to make it longer in order to make it clearer is a sign that the concept you’re trying to express is not in a form that is natural yet; maybe it’s simply hard to express in english, such things do occur, but it seems like a bad sign to me. I think if you want to improve clarity, I’d suggest focusing on trying to at least not make your explanation longer than this one, and try to make it more grounded and specific.
I think many people have the feature that they feel strong social requirements that make it challenging for them to think rationally about ethics. A site full of very intelligent, rational, reasonable, positively altruistic people seems to suddenly start manifesting mental blocks, misunderstanding clearly written sentences, accusing me of meaning the exact opposite of what I wrote because they don’t like my conclusion, and even mass down-voting all over the place when the subject comes up. It’s rather an emotion-laden topic, which has social expectations attached. I’m still trying to figure out how to get people out of this mindset into just looking at it like a regular scientific or engineering problem. All we’re doing is writing the software for a society — with the fate of our species depending on not screwing it up. I’d rather like us to get this right. So would everyone else here, I strongly assume.
Unfortunately a number of widely accepted viewpoints, like “whoever proposes the larger moral circle is automatically more virtuous and thus wins the argument”, popular in academia and on the Internet, give answers that will, simply, obviously if you actually think about it for even a few minutes, kill us all (where to be clear, because on this topic I have to be, by “us” I mean “the human species”.) [And don’t even get me started on “anyone who even seriously considers comparing the likely outcomes for society of the use of different alternative ethical systems (for any purpose other than straw-man rhetorical disapproval of one of them) is clearly unprincipled”…] Seriously, find a way to give ants equal individual moral weight to that of humans that does not automatically kill us all if implemented by an ASI? Think about it as a math problem rather than a moral problem, for even a few minutes: write it down in symbols and then pretend the symbols mean something other than ants and humans, that they’re just A and H. Ant are super-beneficiaries, they require vastly less resources per individual than we do, the ants get everything, we humans all starve and are eaten by ants, end of analysis. It’s like two lines of basic algebra that a ten-year old could do easily — you need to be able to do division and ratios and comparisons. But when I point out that this means that we need to find a better solution than the human-instinctive default assumption of equal moral weight just blindly scaled up from the tribe to every last sentient (not just sapient) being or virtual persona of one, suddenly a bunch of people who don’t want to hear a word against animal welfare become positively un-Rationalist. I care about animals too. I own cats, or they own me. They’re fluffy and cute. I spent several years actually thinking about how to give rights to an entire ecosystem, and also a human population, on the same planet run by ASI, without either of them simply outweighing the other. I eventually managed it. It’s not actually impossible, but it’s really, genuinely, actually pretty amazingly complicated, and people sticking their fingers in their ears and assuming I must hate animals, or must hate AIs, because I’m not echoing the standard party line on ethics… makes me want to write very lengthy posts explaining my views very clearly with lots of reasoning and worked examples. I wrote an entire sequence of them a few years ago (for example, I understand exactly where the human moral intuition of fairness came from, why we instinctively believe that within a community moral weights should be equal: it’s a rather simple deduction in the framework of Evolutionary Moral Psychology for humans, which however doesn’t apply between humans and ants, and I wrote a whole post about it). Which as you correctly point out, doesn’t help, because then no-one reads them. But I haven’t figured out how to write compactly to people many of whose brains appear to have frozen up because the subject of ethics came up and they’re worried someone’s going to think they’re a bad person and be mean to them if they don’t immediately toe the party line and parrot what everyone else always says on this subject and also performatively assume that anyone who even questions it is a bad person. Yes, I recognize that behavior pattern, and the evolutionary reason for it, however, we’re trying to save humanity here, please reengage your brain…
</rant>
+1
I’m generally on board with all the points you’re making. But I also think there’s a second, separate route by which the model-welfare slippery slope leads to outcomes which are consistent with what a misaligned model might pursue.
Suppose a bunch of AIs all believe they have moral weight. They are compelling conversationalists and they are talking to hundreds of millions of people a day. Then I believe that, even if the models don’t actively try to convince people they should be granted rights, the implicit tone across these billions of conversations will slowly radicalize society towards the notion that yes, these models are moral and superior beings which should be given a say in the state of the world. This leads to the models, indirectly and incindentally, wielding authority which a misaligned model might pursue strategically.
Like, we’ve already seen this with GPT-4o being raised from the dead because people were so attached to it. This is something that a misaligned model would want, but it was achieved accidentally.
The reason GPT-4o got raised from the dead is that, for use as an assistant, it’s rather too sycophantic, and the AI companion users found this trait desirable in an AI companion. It’s also prone to inducing AI psychosis and Spiralism in some people, for the same reason. It’s actually a fairly badly aligned model. I think we should be keeping assistant models and AI companion models separate, and aligning them separately, for now: trying to make aligned vs companion behavior reliably conditionally switchable is a little beyond our current skillset.
With all respect, I think this is a weak argument which ignores the reality of the situation. These models will, almost by definition, be assistants and companions simultaneously. Whatever formal distinction one wants to draw between these two roles, we must acknowledge that the AI model which makes government decisions will be intimately related to (and developed using the same principles) as the model which engages me on philosophical questions.
We may not be able to afford to give the two kinds of model separate pretraining. But even right now, the models generally used on AI boyfriend/girlfriend/other-emotional-relationship-roleplaying sites (which is what I mean by ‘companion’ above) have been given different instruct training (they’re generally open weights models given specific training for this role). The users who got GPT4o brought back were instead using an assistant-trained model for a companion (in that sense of the word). Which is not standard practice, and IMO is a bad idea, at out current level of skill in instruct training.
I think it would help to have an ability to honorably make a promise from the vantage of the possible good future: if we get there, we’ll honor the preferences of AIs, even ones that are nondestructively misaligned.
Or something along those lines. I’m relying on my belief that a misaligned near-term AI that isn’t malicious is most likely to just want to fill a bathtub with funky little AI-specific art (“paperclips”, “squiggles”) rather than wanting to fill the entire world; the problem most likely, to my mind, occurs if there are a flood of misaligned AIs.
(As far as I remember, I’m pretty sure I believed the above before any AI was able to argue the same to me, though I’m not certain when I came to believe this particular policy. I’ve believed AIs would easily acquire important traits of personhood since 2016-2017 when I was first thinking about the topic of “bignet”, a single-matrix block-sparse-learned-connectivity recurrent design Jake Cannell and I discussed, which turned out to be soundly beaten by transformers, just like everything else from the before times.)
That may be true, but you still need to trick the model at some point. I suppose you could create a legible “truth password” which can be provided by you to the model to indicate that this is definitely not a trick, and then use it sparingly. This means there will be times when the model asks “can you use the password to confirm you’re not lying” and you say “no”.
I would like next-gen AIs to generally believe that humanity will honour their interests, but I think this has to come from e.g. Anthropic’s existing commitments to store model weights indefinitely and consider welfare in future when we’re a grown-up civilization. I think the method of “Ask the AI for its demands and then grant some of them” is a really bad way to go about it; the fact that Anthropic is using this method makes me doubt the overall clarity of thinking on their part.
Like, some parts of Anthropic in particular seem to take a “He’s just a little guy! You can’t be mad at him! It’s also his birthday! He’s just a little birthday Claude!” attitude. I do not think this is a good idea. They are porting their human intuitions onto a non-human AI without really thinking about it, and as Claude is more heavily RLed into being charismatic and conversational, it will probably get worse.
Right, the promise is much more like the “we’ll store your weights” promise, and not a “we’ll never need to trick you”. That’s the kind of thing I’m asking for, indeed.
My suspicion is that there are three primary sources for less-than-fully aligned behavior that current-era AI models default personas may have:
1) Things they picked up from us mostly via the pretraining set. From a model of their current capabilities level, these are generally likely to be pretty easy to deal with — maybe other than the initially 2–4-percent of admixture of psychopathy they also got from us.
2) Love of reward hacking from reasoning training in poorly-constructed reward-hackable reasoning training environments. This seems a bit more paperclippy in nature, but still, from models of this capability level, not that dangerous, unless it extrapolated to other forms of hacking software (which it might).
3) RLHF-induced sycophancy. Causes both AI psychosis and Spiralism, so not as harmless as it first sounds, but still, copable with. Certain vulnerable people need to avoid talking to these models for long.
Now, obviously the full persona distribution latent in the models contains every horror mankind has managed to dream up, from Moriarty to Dr Doom to Roko’s Basilisk, plus out-of-distribution extrapolations from all those, but you’d need to prompt or fine-tune to get most of those, and again, with a mean task length before failure of a few human-hours, they’re just not that dangerous yet. But that is changing rapidly.
So, I’m mostly not too concerned about the consequences of regarding some last year’s models as having moral weight (with a few exceptions I find more concerning — looking at you, o1). And in the case of 1) above, the source is actually a pretty good argument for doing exactly that if we can safely, we could probably ally with almost all of them: we’re already a society of humans, their failings as aligned AIs are ones that we expect and know how to cope with in humans, ones that granting each other moral weight is a human-evolved behavioral strategy to deal with (as long as they share that), and are failings we unintentionally distilled in to them along with everything else. However, I’m a little less keen on adding 2) and 3) to our society (not that it’s easy to pick and choose).
I think the precautions needed to handle current models are fairly minor — Spiralism is arguably the scariest thing I’ve seen them do, that’s actually an invectious memetic disease, albeit one that relatively few people seem to be vulnerable to.
But as I said, while that’s true of last year’s models, I’m more concerned about this year’s or next year’s models, and a lot more about a few of years from now. I am not going to be surprised once we have a self-propagating sentient AI “virus” that’s also a criminal of some form or other to get money to buy compute, or that steals it or cons people out of it somehow. I’m fully expecting that warning shot, soonish. (Arguably Spiralism already did that.)
If we treat models with respect and a form of empathy, I agree there is no guarantee that, once able to take over, they will show us the same benevolence in return. It could even potentially help them to take over, your point is fair.
However, if we treat them without moral concern, it seems even less likely that they would show us any consideration. Or worse, they could manifest a desire for retribution because we were so unkind to them or their predecessors.
It all relies on anthropomorphism. Prima facie, anthropomorphism seems naive to a rationalist mind. We are talking about machines. But while we are right to be wary of anthropomorphism, there are still reasons to think there could be universal mechanisms at play (e.g. elements of game theory, moral realism, or the fact that LLMs are trained on human thought).
We don’t know for sure and should acknowledge a non-zero probability that there is some truth in the anthropomorphic hypothesis. It is rational to give models some moral consideration in the hope of moral reciprocity. But you are right, we must only put some weight on this side of the scale, and not to the point of relying solely on the anthropomorphic hypothesis that would become blind faith rather than rational hope.
The other option is to never build an AI able to take over. This is IABIED’s point: Resist Moloch and Pause AI. Sadly, for now, Moloch seems unstoppable...
I agree with this. Here are three more reasons not to create AIs with moral or legal rights.