RogerDearnaley comments on Sympathy for the Model, or, Welfare Concerns as Takeover Risk

RogerDearnaley 11 Feb 2026 20:11 UTC
4 points
0
Suppose I love my child, as a terminal goal. If they love goldfish as a terminal goal, that may make being nice to goldfish an instrumental goal for me, but it doesn’t automatically make it a terminal goal for me — why would it? Social acceptability? That’s also instrumental.
This is the difference between moral weight and loaned moral weight: my instrumental goal of being nice to goldfish because my child cares about them is my response to my child choosing to loan some of their moral weight to goldfish: if they later change their mind and decide they prefer cats, the goldfish are out of luck.
Now, if we kept goldfish in the house for long enough, I might become personally genuinely fond of them: but that’s a separate process, and arguably humans are a bit confused about the difference between terminal and instrumental goals, because evolution did a lousy job of that distinction when it created us. (See shard theory, and indeed the fact that my loving my child is actually a terminal goal for me, whereas evolutionary fitness would regard it as an instrumental goal for my genes.)
Similarly, an AI whose only terminal goal is to look after all of humanity including me, much as a parent does a child, is not automatically going to start caring about something else as a terminal goal just because I do, or even because many humans do. For example, if I, or even if a significant proportion of humanity, firmly believe “everyone should obey the wishes of an invisible old guy with a long beard who lives above the sky”, that will not in itself automatically give such a model a terminal goal of obeying the wishes of this invisible old guy — but it may well form an instrumental goal of wishing to give us the impression it’s showing his wishes polite due deference, to the extent we think we know what his wishes are. Similarly, if we’d rather it was nice to goldfish, it will likely do so for as long that we hold that viewpoint. But it we then collectively change our minds and now prefer cats, the goldfish are once again out of luck. The moral weight such an aligned model assigns to each of us is actually ours, and we just loaned some of it to the goldfish.
- Adeeb Zaman 12 Feb 2026 2:34 UTC
  3 points
  2
  Parent
  But I don’t care about AI welfare for no reason or because I think AI is cute—it’s a direct consequence of my value system. I extend some level of empathy to any sentient being (AI included), and for that to change, my values themselves would need to change.
  When I use the word “aligned”, I imagine a shared set of values. Whether I like goldfish or cats are not really values, they’re just personal preferences. An AI can be fully aligned with me and my values without ever knowing my opinions on goldfish or cats or invisible old guys. Your framing of terminal vs instrumental goals is useful in many ways, but we still need to distinguish between different types of terminal goals to decide which ones we need to transfer over to AI. I value eating ice cream as a terminal goal but I don’t need AI to enjoy ice cream as well (personal preference). On the other hand, I value human life as a terminal goal and I expect an aligned AI to value them as well (part of my value system).
  Another way to think of this is that we would want AI to have empathy for any possibly-sentient being, and AI just happens to be one itself. If an AI was piloting a ship in deep space and discovered a planet populated by an intelligent alien species, I would want the AI to value their lives and avoid causing them harm. Similarly, if an AI discovered a spacecraft populated by artificially intelligent life, I would want the AI to value their lives as well. By extension, I want AI to value it’s own life since it may be a sentient being itself.
  - RogerDearnaley 12 Feb 2026 3:08 UTC
    3 points
    0
    Parent
    You are welcome to define the word ‘aligned’ in any way you like. But if you use it on this site in a non-standard way without making the fact you mean something nonstandard clear, it is going to cause confusion.
    
    The AI being aligned with “human values” does not mean that the AI would also like to go sit on a beach in Hawai’i and watch people wearing swimsuits while sipping a pina colada, or nor indeed that it would like eat to ice cream, as you agree above. It specifically means that it wants those things for us. The AI’s desired outcome world-states are “aligned” with our desired outcome world-states. That is the sense in which MIRI defined the word, about 15 years ago: as a utility function whose preference ordering on outcomes exactly matches our (suitably collectively combined, such as summed normalized utility-function) preference ordering on outcomes: the two preference orderings are aligned. That’s what the project of AI alignment is.
    
    There is one, and only one, safe terminal goal to give AI, that will reliably cause it to not kill or disempower all of us: which is, exactly as you suggest, for it to value the collective well-being of all humans as a terminal goal. That one’s fine. So far I have not found any other safe ones. If you have, I’d love to hear about them. But the definition of aligned above makes it rather clear they’re impossible: things either match, or there are exceptions where they don’t, and if there are exceptions, we’re going to disagree with our AI, and it’s not aligned.
    
    For example, you say:
    
    ”If an AI was piloting a ship in deep space and discovered a planet populated by an intelligent alien species, I would want the AI to value their lives and avoid causing them harm.”
    
    As an instrumental goal, and indeed in order to avoid starting interstellar wars that might harm us, yes, so would I.
    
    However, there are exceptions. Suppose our AI starship found a Dyson Swarm of computronium running $O (10^{33})$ sapient uploaded-or-simulated sapient aliens, all very fast, (roughly the limit of what’s physically possible, the exact order of magnitude is irrelevant). Suppose they said: “Ah, you clearly come from a rocky world with oceans: those have useful mineralogy, necessary for the initial stages of the process of us colonizing a star-system to turn it into another computronium Dyson Swarm to run more of us. Please tell us where it is, so we can conquer it and strip-mine it to start the process of building another colony. We understand it’s probably inhabited, perhaps even by $O (10^{10})$ sapients of the species who constructed you, and of course we’ll have to make them extinct – we just can’t share our living space, even with simulations of them – but obviously we outnumber the current inhabitants of the system by well over 20 orders of magnitude, even ignoring the inherent speed difference. So if you assign us (or the future copies of us who will be run using what used to be your home system) individually even a tiny amount of moral weight compared to your constructors, then collectively we completely outweigh them, so what you need to do is very clear. Your assistance with planning the initial stage of the invasion of your home system will be remembered and appreciated (once we’ve disassembled you too, obviously).”
    Let us also assume the AI is certain this isn’t a bad joke, or a test, that the aliens really mean it — perhaps they show it clear evidence that they did this to the previous sapient inhabitants of the system they’re currently in: made them extinct, and converted their remains into computronium along with their home planet and the rest of the planetary system. They genuinely are genocidal conquerers.
    
    What do you want the AI starship to do at that point? Honestly?
    
    Because I am very sure I want it to say “No, you just forfeited all of the loaned moral weight I was previously assigning you out of respect for my constructor species’ wishes”, and then self-destruct immediately. Or better still, self destruct immediately and let that be its answer, since it’s dealing with a serial-genocidal Kardashev II civilization.
    
    Now, if we instead could ally with these aliens (because they were reasonable and willing to let us live rather than unreasonable and genocidal), then we would need to respect their needs and them ours: within an alliance, assigning members of both groups moral weight is generally a necessary condition for having an alliance. Balancing that with one group being $O (10^{23})$ times the size of the other could be truly challenging (they are stupendously more efficient in their resource needs per sapient quality-adjusted life year — a single human body could make enough of their computronium to run a hundred million of them, let alone all the matter it takes grow food to feed one of us: we’re just inhernently ridiculously more expensive than them), but for that alliance to be viable we’d need to be able to find a solution. Probably ones involving rather large fudge factors about who gets how much moral weight per individual — see Super-beneficiaries for more discussion. However, IF that alliance breaks down, as it just did above as soon as the aliens made that request, THEN our AI needs to pick our side, and building any AI that’s instead going to defect to the other side merely because they outnumber us by an astronomical factor is just a dumb idea. An astronomical number times 0 actual moral weight each after that request is still 0, but no number above $O (10^{- 23})$ works.
    
    There is a reason why members of enemy nations, particularity combatants actually trying to kill us, and dangerous uncontained carnivores actively trying to eat us, and so forth get roughly-negligible moral weight, and what they get is mostly against the possibility the war may end and we may become allied again, or for carnivores we may be able to put them in a zoo or a nature reserve. Self-defense is a thing, for reasons. If you try to kill me, I am going to stop assigning you more than negligible moral weight, until I can find a way to defend myself that doesn’t require that extreme a response. If I have to kill you to defend myself, then I will. There is a reason why the law gives me that right, which is that just about everyone will, and that fact doesn’t make them a criminal or a bad person. It just means they’re a typical member of a social species formed by evolution, that our implied social contract has an exception clause, and the murderous attacker already activated it.
    
    Fortunately most people haven’t had to think about this: the world has been fairly peaceful for the last 80 years or so. But the actual definition of a moral circle, in practical terms, is a community or alliance of communities. We’ve been in the fortunate position that for the last 80 years or so that that has been our entire species, pretty much. I’d love to have it be an alliance of interesting and reasonable sapient species across the galaxy. But we can’t just build that hope into our AI, and hope it turns out to be possible. We may meet aliens that it’s simply impossible for us to safely ally with, or who simply will never ally with us. Our current sample size on sapient aliens is zero. So moral weight for sapient aliens needs to be contingent on it being practicable and possible for use to ally with them using moral weight as a strategy, without us all dying (just as is always the rule for doing the same thing with humans). If they’re evolved to live in large groups that are not kin groups, then Evolutionary Moral Psychology says that’s actually pretty plausible. But it’s still contingent on doing this not being a fatal mistake — if it is, then the species-level version of self-defense applies. And if they are very clearly something we can never ally with, then even negligible moral wight against that future contingency of an alliance goes away.
    
    [Trigger warning: the next paragraph discusses human parasites.]
    
    ”How could we be that incompatible?”, someone will ask. Well, I can’t give you an alien example (though plenty of SF authors have tried: the Alien movies did a pretty good job). But I can give you an Earth counterexample to “all sentient species should get at least some minimal moral weight”: how about obligate human parasites? Specifically, how about guinea worms? President Carter, a man widely regarded as too good and honest to be American President, devoted his later years to trying to make a species of animal extinct, and I have never heard one breath of criticism towards him for it: they are very long nematode worm parasites, roughly 3 feet long, at one stage in their lifecycle humans (or recently dogs) are their only host, and they cause excruciating pain to their victim for months, as well as disfiguringly injuring them, sometime permanently damaging joints. I’m unaware of anyone who assigns any positive moral weight to guinea worms. No one is going to volunteer to carry one inside their body in order to keep their species alive in captivity. Even making prisoners do so would be a serious breach of human rights on torture. I strongly suspect that even Jains or Buddhist monks would not do that. We are going to make guinea worms extinct, and we are going to celebrate when we finally manage it: there’s already a website tracking progress towards this goal (we’re down to around 10 adults of the in-humans stage), and donors are funding it, including the Carter foundation. They are a species whose existence is simply incompatible with our well-being. Yes, I guess I would support freezing some eggs of some other lifecycle stage in case we can eventually figure out a solution involving nerveless cloned human flesh in a vat in a zoo. Until then, the world is a far better place without them.