There is a basic question that has been confusing me for a while that I would like to ask about:
Why are the goals of AI safety, like achieving safety from extinction risks, or protection for human wellbeing, not more often framed as the goal of making moral machines? Or in other words, building AI that has a strong and reliable sense of morality and ethics.
There is definitely a lot of discussion around the edges of this question. For example, one recent post by @Richard_Ngo asked whether AI should be aligned to virtues. Or, a post from last year by @johnswentworth described thinking about what the alignment problem is. However, there’s also a huge swath of writing where the concept of machine morality is never invoked or mentioned.
Part of the reason for my curiosity it that it seems like this framing could resolve a lot of confusion and in many ways it seems the most intuitive. For example, this seems like probably the most important framing that we apply, broadly, when trying to raise and educate safe and good humans.
This framing would also provide a nice way of synthesizing many different core AI safety results, like ‘emergent misalignment.’ We could simply say that AI exhibiting emergent misalignment did not possess a strong moral compass, or a strong sense of morality, prior to its fine-tuning.
Is there a kind of history with this framing where it was at some point made to seem outmoded or obsolete? I can imagine various obvious-ish objections, like the fact that morality is hard to define. (But again, the fact that this is the framing we run with humans makes it seem pretty powerful and flexible.) But it’s not clear to me why this framing has any more or less issues than any other.
Greatly appreciate any input, or suggestions of where to look further.
I think its a mix of two (very related) things:
General deep belief in moral anti-realism
Ie, any given human has a set of values, and these values are not “special” from a objective standpoint. You have the values you have, and you follow them because they’re your values, not because there is an external reason those values are “right”
General deep belief in a weak form of orthogonality:
Basically, we can imagine machines pursuing any random goal. As long as we can specify what that goal is, there’s no obstacle to pointing it in that direction in principle.
It follows from these that solving alignment means we should be able to make the AI follow any random goal, and that “simplifying the problem” to only making it follow human values / human morality / the values of any given human, doesn’t buy us much. It doesn’t make the problem easier, because there’s nothing special about human values.
I feel I’m not explaining myself very well. But imagine you wanna launch a rocket into the sun. The sun is a very distinguished and special object relative to earth. Probably your plan for launching the rocket into the sun will need to consider a bunch of details about the sun.
Now imagine instead you wanna launch your rocket to star 8285F171058B in the milky way. Probably most steps of that plan would be the same if you instead were sending the rocket to some different random star. This means that solving the problem of sending a rocket to a random star is a better line of attack than trying to analyse all the properties of star 8285F171058B. Most of those features will not be relevant to the hard part of the problem.
I feel quite a bit of skepticism over the idea that a consensus view of moral anti-realism would have led to a preference for an alignment framing.
For example, amongst non-experts, there is a strong consensus around what is moral and immoral conduct. Amongst moral philosophers, as I understand it, moral anti-realism is also a minority view. My understanding was that moral naturalism was closest to consensus. (Not to say moral anti-realism is necessarily wrong.) If there was some kind of article or post describing how this informed a shift in framing to alignment, however, that would be very interesting and helpful.
Separately, it seems like you’re suggesting that alignment to arbitrary values provides a simpler framework or objective than morality, or what might be described as alignment to all moral patients.
This seems true. However, I guess where my confusion arises is that alignment to arbitrary values is not the safety goal. The safe goal is to make something that respects and adheres broadly to moral values of all humans. Stated differently, narrow alignment is necessary but not sufficient for AI safety to be achieved.
Further, empirically, narrow alignment has been demonstrated for simpler systems now for some time. It seems like the bigger issue now is AI morality; robust ethical behavior, resembling a strong and unflappable moral compass, and so on.
It’s not a consensus view among random people, and its not a consensus view among academic philosophers, but its close to a consensus view among original lesswrong people, reflected in e.g. the sequences.
I agree, but that was the point with the answer and the rocket analogy. You don’t buy much extra by focusing on human morality, and risk confusing yourself (I agree with johnswentworth’s comment to some degree here)
I don’t think this is true. I think that to the degree alignment is demonstrated, its demonstrated as robust ethical behavior. Ie, the behavior of Opus 4.5 is robustly ethical I think.
My concerns are unrelated to “morality” in particular. More stuff like
Are current personas stable under reflection / successors
Will current alignment techniques keep working as we subject models to more and more RL?
WIll current alignment techniques keep working for new architectures?
I.e. neuralese architectures, or architectures that do much more continual learning than current ones do
Do current alignment techniques work on superintelligences?
(I.e. does the proto-ASI model start alignment faking before you even have time to RLHF/constitutional AI it?)
Talks come apart risks
(Even if current alignment techniques work on ASIs, and we get an ASI trying to be nice, do small differences between its notion of niceness and ours cause us all to die when subject to extreme optimization pressure?)
Are current models even aligned in a weak sense?
Aligned behavior in current models comes from a mix of three things
Deep values, the model really cares about the things we want it to care about
The model knows what we want it to do, and does that
The model has a bunch of shallow reflexes that make its behavior appear aligned to ours. Ie, it will not say something bad about the user or talk about reward-hacking / scheming, the same way humans will use she/her pronouns when talking to someone who looks like a woman or correct their gait if they stumble.
Unclear where to distribute. I’d put something like 10/30/60 (which is harsh, but the distribution would not look too different for humans). That leaves some room for other goals and still getting aligned seeming behavior. And this might blow up when we make the AI smarter still.
None of these talk about human morality. Nr 5 talks about “niceness”, but I deliberately use that word to avoid talking about morality, even though “human values/morality” would be a description that has more apt connotations.
I personally avoid even using the words “morality” or “ethics” in the context of AI alignment, because both of those words reliably turn the vast majority of otherwise-sensible people into morons the moment they are spoken.
Could you elaborate? That is surprising to me given the extreme importance of those terms for philosophical analysis of what is “good,” “right,” and so on.
Indeed, invoking the words “good” or “right” also tend to make people dumber (though less so than “morality” or “ethics”), and trying to do philosophical analysis of what is “good” or “right” is exactly the thing which seems to insta-brain-kill people; it’s exactly the lever which “morality” and “ethics” pull.
For example, let’s look at two pages in the Stanford Encyclopedia of Philosophy. I picked these by pulling up the table of contents, and then clicking the first one which seemed not-very-morality-loaded and the first one which seemed very-morality-loaded.
First up, abduction. No morality talk here. It’s describing a feature of human reasoning, which seems functionally load-bearing for epistemics in some cases and would probably generalize to other kinds of minds (like aliens or AI). It doesn’t trivially fit a couple common frames of epistemics, which is why it’s interesting. A lot of the discussion is centered around pretty narrow or outdated models of reasoning, but it’s a technically interesting and sensible article, which inspires good questions at least.
In contrast, the ethics of abortion. Before we even get to the actual content, note the topic. Abduction is a topic relevant to understanding minds and reasoning in general, a topic which would likely be relevant even to AIs; it belongs in a generalizable world-model. Abortion, by contrast, would be irrelevant to many other kinds of minds—e.g. human-level-intelligent platypuses would lay eggs, and therefore the whole issue of abortion would not have a clean analogue for them. (And human-level intelligent ants would be in a whole different frame!) Almost certainly, the reason why a Stanford Encyclopedia page exists for abortion at all is that it was a major hot-button political topic for a while, which won the memetic competition for attention in US politics. But in the grand scheme of things, it is just not that important of a question at all even for humans, and entirely irrelevant to many other kinds of minds. The very fact that people pay so much attention to it is itself a strong sign of mindkill.
Looking at the content of the page… the entire thing is a string of analogies and attempts to generalize various heuristics to the case of abortion. Notably sparse or absent is:
Technical engagement with the developmental process, when various things come online for a fetus/baby (like e.g. pain, self-awareness).
Technical engagement with the way humans’ preferences/values actually typically form. Spoiler: it ain’t usually by thought-experiments involving a violinist.
Technical engagement with the first and second-order actual effects of abortion laws/norms (though laws/norms are of course distinct from morality, consequentialism still matters).
More vibe-ishly, compared to the abduction article, the whole thing very much has a bikeshed vibe to it. It’s all the sort of stuff which would make good fodder for conversation at a house party, not the sort of stuff which involves dense technical study and deep understanding.
Ok, I think I might see what you mean now; one might prefer framings in terms of alignment over morality, because moral framings might tend to provoke controversy, irrationality, or reactionary thinking.
Personally, I feel like I would still tend to prefer the moral framing, in terms of clarity and just plain accuracy. It does seem a little like the alignment framing is obfuscating a subject just to make it less provocative, when really, the subject is going to be provocative, no matter what, when you think about it deeply.
Quite the opposite: the subject-we-gesture-at-with-the-word-”alignment” is not particularly provocative or controversial when you think about it deeply, at least not along the axes people generally argue over in the context of morality/ethics, because those axes just aren’t that technically central or relevant.
Personally, my guess is that morality and ethics themselves would not be particularly controversial or provocative if people usually approached them with a goal of deep technical understanding. That’s just not the goal with which approximately-anybody, including nearly all professional philosophers, approaches the subject—as we see e.g. on that Stanford Encyclopedia page. Those are people trying to have the equivalent of fun house party conversations, or in some cases write manifestos, not people seriously trying to achieve deep technical understanding.
I want to link to Lukeprog’s classic LW essays on this subject, Train Philosophers with Pearl and Kahneman, not Plato and Kant, and Philosophy Needs to Trust Your Rationality Even Though It Shouldn’t. Two quotes from the latter:
and I think about this framing a lot
I think it is a losing fight to attempt to get consensus on philosophical questions about meta-ethics, and agree with strongly avoiding such attempts when possible.
As far as I understand arguments by, e.g., Kokotajlo, @Wei Dai, etc, morality is, at the very least, FAR from being solved (or outright insoluble, e.g. if Wei Dai’s alternative #5 ends up being true) and even moral intuitions are currently formed through an untrustworthy mechanism.
It is true that morality is complex and there are different ways of deriving morality, or what is “right” and “wrong”; but then again, there is broad consensus about what you teach a child when you are teaching them morality and ethics. It seems to me that when humans fall short in moral conduct, it is most often an issue with their conduct, rather than an issue with morality being hard to define. But even if it is hard to define, I suppose my question remains—why is it a less common framing than ‘alignment’? Did at some point, people decide that alignment was more solvable than morality?
I think it’s historical. The alignment approach to AI existential safety is associated with very strong and very influential thinkers (e.g. Eliezer himself).
So the development of alternatives to that has been an uphill battle.
My hope is that people will start to reconsider in light of many recent developments, the latest of which is the confrontation around the “Department of War” demands that advanced AI systems used by them be aligned to whatever the Department officials decide to be right.