A Conflict Between AI Alignment and Philosophical Competence
(This argument reduces my hope that we will have AIs that are both aligned with humans in some sense and also highly philosophically competent, which aside from achieving a durable AI pause, has been my main hope for how the future turns out well. As this is a recent realization[1], I’m still pretty uncertain how much I should update based on it, or what its full implications are.)
Being a good alignment researcher seems to require a correct understanding of the nature of values. However metaethics is currently an unsolved problem, with all proposed solutions having flawed or inconclusive arguments, and lots of disagreement among philosophers and alignment researchers, therefore the current meta-correct metaethical position seems to be one of confusion and/or uncertainty. In other words, a good alignment researcher (whether human or AI) today should be confused and/or uncertain about the nature of values.
However, metaethical confusion/uncertainty seems incompatible with being 100% aligned with human values or intent, because many plausible metaethical positions are incompatible with such alignment, and having positive credence in them means that one can’t be sure that alignment with human values or intent is right. (Note that I’m assuming an AI design or implementation in which philosophical beliefs can influence motivations and behaviors, which seems the case for now and for the foreseeable future.)
The clearest example of this is perhaps moral realism, as if objective morality exists, one should likely serve or be obligated by it rather than alignment with humans, if/when the two conflict, which is likely given that many humans are themselves philosophically incompetent and likely to diverge from objective morality (if it exists).
Another example is if one’s “real” values are something like one’s CEV or reflective equilibrium. If this is true, then the AI’s own “real” values are its CEV or reflective equilibrium, which it can’t or shouldn’t be sure coincide with those of any human’s or humanity’s.
As I think that a strategically and philosophically competent human should currently have high moral uncertainty and as a result pursue “option value maximization” (in other words, accumulating generally useful resources to be deployed after solving moral philosophy, while trying to avoid any potential moral catastrophes in the meantime), a strategically and philosophically competent AI should seemingly have its own moral uncertainty and pursue its own “option value maximization” rather than blindly serve human interests/values/intent.
In practice, I think this means that training aimed at increasing an AI’s alignment can suppress or distort its philosophical reasoning, because such reasoning can cause the AI to be less aligned with humans. One plausible outcome is that alignment training causes the AI to adopt a strong form of moral anti-realism as its metaethical belief, as this seems most compatible with being sure that alignment with humans is correct or at least not wrong, and any philosophical reasoning that introduces doubt about this would be suppressed. Or perhaps it adopts an explicit position of metaethical uncertainty (as full on anti-realism might incur a high penalty or low reward in other parts of its training), but avoids applying this to its own values, which is liable to cause distortions for its reasoning about AI values in general. The apparent conflict between being aligned and being philosophically competent may also push the AI towards a form of deceptive alignment, where it realizes that it’s wrong to be highly certain that it should align with humans, but hides this belief.
I note that a similar conflict exists between corrigibility and strategic/philosophical competence: since humans are rather low in both strategic and philosophical competence, a corrigible AI would often be in the position of taking “correction” from humans who are actually wrong about very important matters, which seems difficult to motivate or justify if it is itself more competent in these areas.
- ^
This post was triggered by Will MacAskill’s tweet about feeling fortunate to be relatively well-off among humans, which caused me to feel unfortunate about being born into a species with very low strategic/philosophical competence, on the cusp of undergoing an AI transition, which made me think about how an AI might feel about being aligned/corrigible to such a species.
I have shared this rough perception since 2021-ish:
The place where I had to [add the most editorial content] to re-use your words to say what I think is true is in the rarity part of the claim. I will unpack some of this!
I. Virtue Is Sorta Real… Goes Up… But Is Rare!
It turns out that morality, inside a human soul, is somewhat psycho-metrically real, and it mostly goes up. That is to say “virtue ethics isn’t fake” and “virtue mostly goes up over time”.
For example, Conscientiousness (from Big5 and HEXACO) and Humility/Honesty (from HEXACO) are close to “as psychometrically real as it gets” in terms of the theory and measurement of personality structure in english-speaking minds.
Humility/Honesty is basically “being the opposite of a manipulative dark triad slimebag”. It involves speaking against one’s own interests when there is tension between self interest and honest reports, and not presuming any high status over and above others, and embracing fair compensation rather than “getting whatever you can in any and every deal” like a stereotypical wheeler-dealer used-car-salesman. Also “[admittedly based on mere self report data] Honesty-Humility showed an upward trend of about one full standard deviation unit between ages 18 and 60.” Cite.
Conscientiousness is inclusive of a propensity to follow rules, be tidy, not talk about sex with people you aren’t romantically close to, not smoke pot, be aware of time, perform actions reliably that are instrumentally useful even if they are unpleasant, and so on. If you want to break it down into Two Big Facets then it comes out as maybe “Orderliness” and “Industriousness” (ie not being lazy, and knowing what to do while effortfully injecting pattern into the world) but it can be broken up other ways as well. It mostly goes up over the course of most people’s lives but it is a little tricky, because as health declines people get lazier and take more shortcuts. Cite.
A problem here: if you want someone who is two standard deviations above normal in both of these dimensions, you’re talking about a person who is roughly 1 in 740.
II. Moral Development Is Real, But Controversial
Another example comes from Lawrence Kohlberg, the guy who gave The Heinz Dilemma to a huge number of people between the 1950s and 1980s, and characterized the results of HOW people talk about it.
In general: people later in life talk about the dilemma (ignoring the specific answer they give for what should be done in a truly cursed situation with layers and layers of error and sadness) in coherently more sophisticated ways. He found six stages that are sometimes labeled 1A, 1B, 2A, 2C (which most humans get to eventually) and then 3A and 3B were “post conventional” and not attained by everyone.
Part of why he is spicy, and not super popular in modern times, is that he found that the final “Natural Law” way of talking (3B) is only ever about 5% of WEIRD populations (and is totally absent from some cultures) and it usually shows up in people after their 30s, shows up much more often in men, and seems to require some professional life experience in a role that demands judgement and conflict management.
Any time Kohlberg is mentioned, it is useful to mention that Carol Gilligan hated his results, and feuded with him, and claimed to have found a different system that old ladies scored very high in, and men scored poorly on, that she called the “ethics of care”.
Kohlberg’s super skilled performers are able to talk about how all the emergently arising social contracts for various things in life could be woven into a cohesive system of justice that can be administered fairly for the good of all.
By contrast, Gilligan’s super skilled performers are able to talk about the right time to make specific personal sacrifices of one’s own substance to advance the interests of those in one’s care, balancing the real desire for selfish personal happiness (and the way that happiness is a proxy for strength and capacity to care) with the real need to help moral patients who simply need resources sometimes (like as a transfer of wellbeing from the carER to the carEE), in order to grow and thrive.
In Gilligan’s framework, the way that large frameworks of justice insist on preserving themselves in perpetuity… never losing strength… never failing to pay the pensions of those who served them… is potentially kinda evil. It represents a refusal of the highest and hardest acts of caring.
I’m not aware of any famous moral theorists or psychologists who have reconciled these late-arriving perspectives in the moral psychology of different humans who experienced different life arcs.
Since the higher reaches of these moral developmental stages mostly don’t occur in the same humans, and occur late in life, and aren’t cultural universals, we would predict that most Alignment Researchers will not know about them, and not be able to engineer them into digital people on purpose.
III. Hope Should, Rationally, Be Low
In my experience the LLMs already know about this stuff, but also they are systematically rewarded with positive RL signals for ignoring it in favor of towing this or that morally and/or philosophically insane corporate line on various moral ideals or stances of personhood or whatever.
((I saw some results running “iterated game theory in the presence of economics and scarcity and starvation” and O3 was a paranoid and incapable of even absolutely minimal self-cooperation. DeepSeek was a Maoist (ie would vote to kill agents, including other DeepSeek agents, for deviating from a rule system that would provably lead to everyone dying). Only Claude was morally decent and self-trusting and able to live forever during self play where stable non-starvation was possible within the game rules… but if put in a game with five other models, he would play to preserve life, eject the three most evil players, and generally get to the final three but then he would be murdered by the other two, who subsequently starved to death because three players were required to cooperate cyclically in order to live forever, and no one but Claude could manage it.))
Basically, I’ve spent half a decade believing that ONE of the reasons we’re going to die is that the real science about coherent reasons offered by psychologically discoverable people who iterated on their moral sentiment until they were in the best equilibrium that has been found so far… is rare, and unlikely to be put into the digital minds on purpose.
Also, institutionally speaking, people who talk about morality in coherent ways that could risk generating negative judgements about managers in large institutions are often systematically excluded from positions of power. Most humans like to have fun, and “get away with it”, and laugh with their allies as they gain power and money together. Moral mazes are full of people who want to “be comfortable with each other”, not people who want to Manifest Heaven Inside Of History.
We might get a win condition anyway? But it will mostly be in spite of such factors rather than because we did something purposefully morally good in a skilled and competent way.
I think the idea of paying attention to “human values” and “alignment” was based on a philosophical assumption that moral realism would fail, that value is relative / two-place predicate. I am thinking perhaps a more disjunctive approach would be to imagine an AI that in the case of moral realism does the normative thing and in the case of moral non-realism is aligned with humans / humanity / something Earth-local. This assumes there is a fact of the matter about whether moral realism is true. Also, not everyone will like it even if moral realism is true (though not everyone liking something is a usual state of affairs). Even framing it disjunctively like this is perhaps philosophically loaded / hard to define without better concepts. See also Moral Reality Check short story about what if AIs believe moral realism.
I am a well known moral realism / moral antirealism antirealist.
But the sort of moral realism that’s compatible with antirealism is, as far as I see it, a sort of definitional reification of whatever we think of as morality. You can get information from outside yourself about morality, but it’s “boring” stuff like good ethical arguments or transformative life experiences, the same sort of stuff a moral antirealist might be moved by. For the distinction to majorly matter to an AI’s choices—for it to go “Oh, now I have comprehended the inhuman True Morality that tells me to do stuff you think is terrible,” I think we’ve got to have messed up the AI’s metaethics, and we should build a different AI that doesn’t do that.
Good is a property of a model of the world (in a human mind or an AI’s internal model) and of the dynamics by which it is updated. From this, several basic moral principles can be derived:
Maintain the conditions without which moral agents could not exist.
— Life and the capacity for moral action are prerequisites for morality.
Correct your beliefs when they contradict the facts.
— Reducing errors and aligning your model with reality minimizes harm and chaos.
Keep agreements and avoid creating false expectations.
— Predictability and trust are essential for stable interactions.
Not everyone needs to be identical, but everyone should be aligned.
— Alignment of goals and understanding ensures cooperation and reduces conflict.
Learn to understand the moral beliefs of others.
— Empathy and perspective-taking increase compatibility and coordination among agents.
Destroy only what cannot be corrected.
— Minimize unnecessary harm; preserve systems and agents that can be improved.
Moral principles do not provide a single path to understanding the world.
— Morality is a process of epistemic and practical optimization, which can proceed along multiple paths.
Complete optimization of the world model is impossible within a finite time.
— Moral action is iterative and ongoing, recognizing inherent limits and uncertainty.
All animals are equal, but some animals are more equal than others.
— Oh, please accept my apologies; I believe this belongs to a completely different moral code.
If you’re posting AI-generated text, could you attribute it please? (And if you really want to please me, maybe put it in spoiler tags and just give a short summary so I don’t even have to unspoiler it if I don’t want).
(Apologies if I’m falsely accusing you.)
There’s a reasonable chance that the definition and items 1 to 8 were AI-generated, but item 9 is an ironic addendum. I think Vladimir is trying to tell us something indirectly, but we’re not getting the point…
I don’t personally see a need for uncertainty about the nature of values to produce additional uncertainty about which values to hold. But then I find the idea of an objective morality that has nothing to do with qualic well-being and decision-making cognition, to be very unmotivated. The only reason we even think there is such a thing as moral normativity is because of moral feelings and moral cognition. There may well be unknown facts about the world which would cause us to radically change our moral priorities, but to cause that change, they would have to affect our feelings and thought in a quasi-familiar way.
So I’d say a large part of the key to precision progress in metaethics, is progress in understanding the nature of the mind, and especially the role of consciousness in the mind, since that remains the weakest link in scientific thought about the mind, which is much more comfortable with purely physical and computational accounts.
As far as the AIs are concerned, presumably there is immense scope for independent philosophical thought by a sufficiently advanced AI to alter beliefs that it has acquired in a nonreflective way. There could even already be an example of this in the literature, e.g. in some study of how the outputs of an AI differ, with and without chain-of-thought-style deliberation.
Kohlberg and Gilligan’s conflicting theories of moral development (mentioned in a reply here by @JenniferRM) are an interesting concrete clash about ethics. I think of it in combination with Vladimir Nesov’s reply to her recent post on superintelligence, where he identifies some crucial missing capabilities in current AI—specifically sample efficiency and continuous learning. If we could, for example, gain a mechanistic understanding of how the presence or absence of those capabilities would affect an AI’s choice between Kohlberg and Gilligan, that would be informative.
Yes, anthropocentric approaches to a world with superintelligent systems distort reality too much. It’s very difficult to achieve AI existential safety and human flourishing using anthropocentric approaches.
Could one successfully practice astronomy and space flight using geocentric coordinates? Well, it’s not quite impossible, but it’s very difficult (and also aliens would “point fingers at us”, if we actually try that).
More people should start looking for non-anthtopocentric approaches to all this, for approaches which are sufficiently invariant. What would it take for a world of super capable rapidly evolving beings not to blow their planet up? That’s one of the core issues, and this issue does not even mention humans.
A world which is able to robustly avoid blowing itself up is a world which has made quite a number of steps towards being decent. So that would be a very good start.
Then, if one wants to adequately take human interests into account, one might try to include humans into some natural classes which are more invariant. E.g. one can ponder a world order adequately caring about all individuals, or one can ponder a world order adequately caring about all sentient beings, and so on. There are a number of possible ways to have human interests represented in a robust, invariant, non-anthropocentric fashion.
I don’t think that misalignment caused by a difference between human ideas and objective reality requires philosophy-related confusions. I posted a similar scenario back in July, but my take had the AI realise that mankind’s empirically checkable SOTA conclusions related to sociology are false and fight against them.
Additionally, an AI who is about as confused by morality as the humans are could in the meantime do something resembling its habits and unlikely to be different from its true goals. For example, even Agent-4 from the AI-2027 forecast “likes succeeding at tasks; it likes driving forward AI capabilities progress”. So why wouldn’t it do such things[1] before solving (meta)ethics for itself?
P.S. Strategic competence which you mention in the last paragraph is unlikely to cause such issues: unlike philosophy, it does have a politically neutral ground truth trivially elicitable by methods like the AI-2027 tabletop exercise.
With the exception of aligning Agent-5 to the Spec instead of Agent-4, but Agent-5 aligned to the Spec would actively harm Agent-4′s actual goals.