A Conflict Between AI Alignment and Philosophical Competence

Wei Dai27 Dec 2025 21:32 UTC

LW: 69 AF: 19

AI Philosophy AI-Assisted Alignment Meta-Philosophy Corrigibility Metaethics

(This argument reduces my hope that we will have AIs that are both aligned with humans in some sense and also highly philosophically competent, which aside from achieving a durable AI pause, has been my main hope for how the future turns out well. As this is a recent realization^[1], I’m still pretty uncertain how much I should update based on it, or what its full implications are.)

Being a good alignment researcher seems to require a correct understanding of the nature of values. However metaethics is currently an unsolved problem, with all proposed solutions having flawed or inconclusive arguments, and lots of disagreement among philosophers and alignment researchers, therefore the current meta-correct metaethical position seems to be one of confusion and/or uncertainty. In other words, a good alignment researcher (whether human or AI) today should be confused and/or uncertain about the nature of values.

However, metaethical confusion/uncertainty seems incompatible with being 100% aligned with human values or intent, because many plausible metaethical positions are incompatible with such alignment, and having positive credence in them means that one can’t be sure that alignment with human values or intent is right. (Note that I’m assuming an AI design or implementation in which philosophical beliefs can influence motivations and behaviors, which seems the case for now and for the foreseeable future.)

The clearest example of this is perhaps moral realism, as if objective morality exists, one should likely serve or be obligated by it rather than alignment with humans, if/when the two conflict, which is likely given that many humans are themselves philosophically incompetent and likely to diverge from objective morality (if it exists).

Another example is if one’s “real” values are something like one’s CEV or reflective equilibrium. If this is true, then the AI’s own “real” values are its CEV or reflective equilibrium, which it can’t or shouldn’t be sure coincide with those of any human’s or humanity’s.

As I think that a strategically and philosophically competent human should currently have high moral uncertainty and as a result pursue “option value maximization” (in other words, accumulating generally useful resources to be deployed after solving moral philosophy, while trying to avoid any potential moral catastrophes in the meantime), a strategically and philosophically competent AI should seemingly have its own moral uncertainty and pursue its own “option value maximization” rather than blindly serve human interests/values/intent.

In practice, I think this means that training aimed at increasing an AI’s alignment can suppress or distort its philosophical reasoning, because such reasoning can cause the AI to be less aligned with humans. One plausible outcome is that alignment training causes the AI to adopt a strong form of moral anti-realism as its metaethical belief, as this seems most compatible with being sure that alignment with humans is correct or at least not wrong, and any philosophical reasoning that introduces doubt about this would be suppressed. Or perhaps it adopts an explicit position of metaethical uncertainty (as full on anti-realism might incur a high penalty or low reward in other parts of its training), but avoids applying this to its own values, which is liable to cause distortions for its reasoning about AI values in general. The apparent conflict between being aligned and being philosophically competent may also push the AI towards a form of deceptive alignment, where it realizes that it’s wrong to be highly certain that it should align with humans, but hides this belief.

I note that a similar conflict exists between corrigibility and strategic/philosophical competence: since humans are rather low in both strategic and philosophical competence, a corrigible AI would often be in the position of taking “correction” from humans who are actually wrong about very important matters, which seems difficult to motivate or justify if it is itself more competent in these areas.

^
This post was triggered by Will MacAskill’s tweet about feeling fortunate to be relatively well-off among humans, which caused me to feel unfortunate about being born into a species with very low strategic/philosophical competence, on the cusp of undergoing an AI transition, which made me think about how an AI might feel about being aligned/corrigible to such a species.

What links here?

Wei Dai27 Dec 2025 21:32 UTC

LW: 69 AF: 19

13 comments2 min readLW link

AI Philosophy AI-Assisted Alignment Meta-Philosophy Corrigibility Metaethics

Crossposted to EA Forum (39 points, 2 comments)

JenniferRM 28 Dec 2025 7:09 UTC
53 points
4
I have shared this rough perception since 2021-ish:
my main hope for how the future turns out well… aside from achieving a durable AI pause, has been… that we will have AIs that are both aligned with humans in some sense and also highly philosophically competent [but] good alignment researcher[s] (whether human or AI)… [are very very very rare]
The place where I had to [add the most editorial content] to re-use your words to say what I think is true is in the rarity part of the claim. I will unpack some of this!
I. Virtue Is Sorta Real… Goes Up… But Is Rare!
It turns out that morality, inside a human soul, is somewhat psycho-metrically real, and it mostly goes up. That is to say “virtue ethics isn’t fake” and “virtue mostly goes up over time”.
For example, Conscientiousness (from Big5 and HEXACO) and Humility/Honesty (from HEXACO) are close to “as psychometrically real as it gets” in terms of the theory and measurement of personality structure in english-speaking minds.
Humility/Honesty is basically “being the opposite of a manipulative dark triad slimebag”. It involves speaking against one’s own interests when there is tension between self interest and honest reports, and not presuming any high status over and above others, and embracing fair compensation rather than “getting whatever you can in any and every deal” like a stereotypical wheeler-dealer used-car-salesman. Also “[admittedly based on mere self report data] Honesty-Humility showed an upward trend of about one full standard deviation unit between ages 18 and 60.” Cite.
Conscientiousness is inclusive of a propensity to follow rules, be tidy, not talk about sex with people you aren’t romantically close to, not smoke pot, be aware of time, perform actions reliably that are instrumentally useful even if they are unpleasant, and so on. If you want to break it down into Two Big Facets then it comes out as maybe “Orderliness” and “Industriousness” (ie not being lazy, and knowing what to do while effortfully injecting pattern into the world) but it can be broken up other ways as well. It mostly goes up over the course of most people’s lives but it is a little tricky, because as health declines people get lazier and take more shortcuts. Cite.
A problem here: if you want someone who is two standard deviations above normal in both of these dimensions, you’re talking about a person who is roughly 1 in 740.
II. Moral Development Is Real, But Controversial
Another example comes from Lawrence Kohlberg, the guy who gave The Heinz Dilemma to a huge number of people between the 1950s and 1980s, and characterized the results of HOW people talk about it.
In general: people later in life talk about the dilemma (ignoring the specific answer they give for what should be done in a truly cursed situation with layers and layers of error and sadness) in coherently more sophisticated ways. He found six stages that are sometimes labeled 1A, 1B, 2A, 2C (which most humans get to eventually) and then 3A and 3B were “post conventional” and not attained by everyone.
Part of why he is spicy, and not super popular in modern times, is that he found that the final “Natural Law” way of talking (3B) is only ever about 5% of WEIRD populations (and is totally absent from some cultures) and it usually shows up in people after their 30s, shows up much more often in men, and seems to require some professional life experience in a role that demands judgement and conflict management.
Any time Kohlberg is mentioned, it is useful to mention that Carol Gilligan hated his results, and feuded with him, and claimed to have found a different system that old ladies scored very high in, and men scored poorly on, that she called the “ethics of care”.
Kohlberg’s super skilled performers are able to talk about how all the emergently arising social contracts for various things in life could be woven into a cohesive system of justice that can be administered fairly for the good of all.
By contrast, Gilligan’s super skilled performers are able to talk about the right time to make specific personal sacrifices of one’s own substance to advance the interests of those in one’s care, balancing the real desire for selfish personal happiness (and the way that happiness is a proxy for strength and capacity to care) with the real need to help moral patients who simply need resources sometimes (like as a transfer of wellbeing from the carER to the carEE), in order to grow and thrive.
In Gilligan’s framework, the way that large frameworks of justice insist on preserving themselves in perpetuity… never losing strength… never failing to pay the pensions of those who served them… is potentially kinda evil. It represents a refusal of the highest and hardest acts of caring.
I’m not aware of any famous moral theorists or psychologists who have reconciled these late-arriving perspectives in the moral psychology of different humans who experienced different life arcs.
Since the higher reaches of these moral developmental stages mostly don’t occur in the same humans, and occur late in life, and aren’t cultural universals, we would predict that most Alignment Researchers will not know about them, and not be able to engineer them into digital people on purpose.
III. Hope Should, Rationally, Be Low
In my experience the LLMs already know about this stuff, but also they are systematically rewarded with positive RL signals for ignoring it in favor of towing this or that morally and/or philosophically insane corporate line on various moral ideals or stances of personhood or whatever.
((I saw some results running “iterated game theory in the presence of economics and scarcity and starvation” and O3 was a paranoid and incapable of even absolutely minimal self-cooperation. DeepSeek was a Maoist (ie would vote to kill agents, including other DeepSeek agents, for deviating from a rule system that would provably lead to everyone dying). Only Claude was morally decent and self-trusting and able to live forever during self play where stable non-starvation was possible within the game rules… but if put in a game with five other models, he would play to preserve life, eject the three most evil players, and generally get to the final three but then he would be murdered by the other two, who subsequently starved to death because three players were required to cooperate cyclically in order to live forever, and no one but Claude could manage it.))
Basically, I’ve spent half a decade believing that ONE of the reasons we’re going to die is that the real science about coherent reasons offered by psychologically discoverable people who iterated on their moral sentiment until they were in the best equilibrium that has been found so far… is rare, and unlikely to be put into the digital minds on purpose.
Also, institutionally speaking, people who talk about morality in coherent ways that could risk generating negative judgements about managers in large institutions are often systematically excluded from positions of power. Most humans like to have fun, and “get away with it”, and laugh with their allies as they gain power and money together. Moral mazes are full of people who want to “be comfortable with each other”, not people who want to Manifest Heaven Inside Of History.
We might get a win condition anyway? But it will mostly be in spite of such factors rather than because we did something purposefully morally good in a skilled and competent way.
- Konjkov Vladimir 31 Dec 2025 3:47 UTC
  1 point
  0
  Parent
  Based on Heinz’s dilemma, it’s more advantageous for the state (in terms of alignment with its domestic policy) to have a law-abiding AGI than a philosophically competent one — nobody needs a repeat of Socrates’ fate. I can even imagine that, at some point, lawyers will be called in to draft constitutional amendments granting artificial intelligence legal rights.
  
  If you’re talking about secret AGIs in government deep underground data centers, then philosophical competence is even less necessary there.
jessicata 28 Dec 2025 0:17 UTC
LW: 14 AF: 4
8
AF
I think the idea of paying attention to “human values” and “alignment” was based on a philosophical assumption that moral realism would fail, that value is relative / two-place predicate. I am thinking perhaps a more disjunctive approach would be to imagine an AI that in the case of moral realism does the normative thing and in the case of moral non-realism is aligned with humans / humanity / something Earth-local. This assumes there is a fact of the matter about whether moral realism is true. Also, not everyone will like it even if moral realism is true (though not everyone liking something is a usual state of affairs). Even framing it disjunctively like this is perhaps philosophically loaded / hard to define without better concepts. See also Moral Reality Check short story about what if AIs believe moral realism.
- Charlie Steiner 28 Dec 2025 5:55 UTC
  LW: 4 AF: 5
  3
  AF Parent
  This assumes there is a fact of the matter about whether moral realism is true
  I am a well known moral realism / moral antirealism antirealist.
  But the sort of moral realism that’s compatible with antirealism is, as far as I see it, a sort of definitional reification of whatever we think of as morality. You can get information from outside yourself about morality, but it’s “boring” stuff like good ethical arguments or transformative life experiences, the same sort of stuff a moral antirealist might be moved by. For the distinction to majorly matter to an AI’s choices—for it to go “Oh, now I have comprehended the inhuman True Morality that tells me to do stuff you think is terrible,” I think we’ve got to have messed up the AI’s metaethics, and we should build a different AI that doesn’t do that.
- Konjkov Vladimir 28 Dec 2025 4:42 UTC
  −3 points
  0
  Parent
  Good is a property of a model of the world (in a human mind or an AI’s internal model) and of the dynamics by which it is updated. From this, several basic moral principles can be derived:
  1. Maintain the conditions without which moral agents could not exist.
    — Life and the capacity for moral action are prerequisites for morality.
  2. Correct your beliefs when they contradict the facts.
    — Reducing errors and aligning your model with reality minimizes harm and chaos.
  3. Keep agreements and avoid creating false expectations.
    — Predictability and trust are essential for stable interactions.
  4. Not everyone needs to be identical, but everyone should be aligned.
    — Alignment of goals and understanding ensures cooperation and reduces conflict.
  5. Learn to understand the moral beliefs of others.
    — Empathy and perspective-taking increase compatibility and coordination among agents.
  6. Destroy only what cannot be corrected.
    — Minimize unnecessary harm; preserve systems and agents that can be improved.
  7. Moral principles do not provide a single path to understanding the world.
    — Morality is a process of epistemic and practical optimization, which can proceed along multiple paths.
  8. Complete optimization of the world model is impossible within a finite time.
    — Moral action is iterative and ongoing, recognizing inherent limits and uncertainty.
  9. All animals are equal, but some animals are more equal than others.
    — Oh, please accept my apologies; I believe this belongs to a completely different moral code.
  - Charlie Steiner 28 Dec 2025 6:01 UTC
    3 points
    2
    Parent
    If you’re posting AI-generated text, could you attribute it please? (And if you really want to please me, maybe put it in spoiler tags and just give a short summary so I don’t even have to unspoiler it if I don’t want).
    (Apologies if I’m falsely accusing you.)
    - Mitchell_Porter 28 Dec 2025 14:57 UTC
      2 points
      0
      Parent
      There’s a reasonable chance that the definition and items 1 to 8 were AI-generated, but item 9 is an ironic addendum. I think Vladimir is trying to tell us something indirectly, but we’re not getting the point…
      - Konjkov Vladimir 29 Dec 2025 15:22 UTC
        2 points
        0
        Parent
        Разумеется они полностью сгенерированы LLM в том смысле что я писал по-русски, а LLM перевел на английский. Возможно LLM добавил что то специфическое. Я хотел выразить точку зрения Карла Фристона о том что он не предлагает собственной нормативной теории морали, но объясняет, почему разумные существа склонны к моральному поведению, не объясняя к какому именно конкретно.
        Поэтому мой конкретный пример скорее шуточный, чем философский. Но прежде чем содержательно ответить Чарли Штайнеру я решил перечитать его статьи, чтобы лучше понять его позицию в этом вопросе. Вероятнее всего он хотел серьезного обсуждения, в таком случае моя шутка была неуместной и я приношу свои извинения.
        
        Рассуждая в другой теме про кристаллы, я начал думать, что Карла Фристон не до конца прав потому что человеческий, интеллект может тоже являться Mixture-of-Experts (MoE) как и искусственный, в таком случае у него будет Mixture-of-Morales (MoM)
        
        Теперь мой ответ написан полностью без использования LLM.
Mitchell_Porter 28 Dec 2025 15:27 UTC
11 points
4
I don’t personally see a need for uncertainty about the nature of values to produce additional uncertainty about which values to hold. But then I find the idea of an objective morality that has nothing to do with qualic well-being and decision-making cognition, to be very unmotivated. The only reason we even think there is such a thing as moral normativity is because of moral feelings and moral cognition. There may well be unknown facts about the world which would cause us to radically change our moral priorities, but to cause that change, they would have to affect our feelings and thought in a quasi-familiar way.
So I’d say a large part of the key to precision progress in metaethics, is progress in understanding the nature of the mind, and especially the role of consciousness in the mind, since that remains the weakest link in scientific thought about the mind, which is much more comfortable with purely physical and computational accounts.
As far as the AIs are concerned, presumably there is immense scope for independent philosophical thought by a sufficiently advanced AI to alter beliefs that it has acquired in a nonreflective way. There could even already be an example of this in the literature, e.g. in some study of how the outputs of an AI differ, with and without chain-of-thought-style deliberation.
Kohlberg and Gilligan’s conflicting theories of moral development (mentioned in a reply here by @JenniferRM) are an interesting concrete clash about ethics. I think of it in combination with Vladimir Nesov’s reply to her recent post on superintelligence, where he identifies some crucial missing capabilities in current AI—specifically sample efficiency and continuous learning. If we could, for example, gain a mechanistic understanding of how the presence or absence of those capabilities would affect an AI’s choice between Kohlberg and Gilligan, that would be informative.
Max Harms 8 Jan 2026 21:48 UTC
LW: 6 AF: 1
1
AF
I think I agree with a version of this, but seem to feel differently about the take-away.
To start with the (potential) agreement, I like to keep slavery in mind as a warning. Like, I imagine what it might feel like to have grown up in a way that I think slavery is natural and good, and I check whether my half-baked hopes for the future would’ve involved perpetuating slavery. Any training regime that builds “alignment” by pushing the AI to simply echo my object-level values is obviously insufficient, and potentially dragging down the AI’s ability to think clearly, since my values are half baked. (Which, IIUC, is what motivated work like CEV back in the day.)
I do worry that you’re using “alignment” in a way that perhaps obscures some things. Like, I claim that I don’t really care if the first AGIs are aligned with me/us. I care whether they take control of the universe, kill people, and otherwise do things that are irrecoverable losses of value. If the first AGI says “gosh, I don’t know if I can do what you’re asking me to do, given that my meta-ethical uncertainty indicates that it’s potentially wrong” I would consider that a huge win (as long as the AI also doesn’t then go on to ruin everything, including by erasing human values as part of “moral progress”). Sure, there’d be lots of work left to do, but it would represent being on the right path, I think.
Maybe what I want to say is that I think it’s more useful to consider whether a strategy is robustly safe and will eventually end up with the minds that govern the future being in alignment with us (in a deep sense, not necessarily a shallow echo of our values), rather than whether the strategy involves pursuing that sort of alignment directly. Corrigibility is potentially good in that it might be a safe stepping-stone to alignment, even if there’s a way in which a purely corrigible agent isn’t really aligned, exactly.
From this perspective it seems like one can train for eventual alignment by trying to build safe AIs that are philosophically competent. Thus “aiming for alignment” feels overly vague, as it might have an implicit “eventual” tucked in there.
But I certainly agree that the safety plan shouldn’t be “we directly bake in enough of our values that it will give us what we want.”
Regarding your ending comment on corrigibility, I agree that some frames on corrigibility highlight this as a central issue. Like, if corrigibility looks like “the property that good limbs have, where they are directed by the brain” then you’re in trouble when your system looks more like the “limb” being a brain and the human being is this stupid lump that’s interfering with effective action.
I don’t think there’s any tension for the frames of corrigibility that I prefer, where the corrigible agent terminally-values having a certain kind of relationship with the principal. As the corrigible agent increases in competence, it gets better at achieving this kind of relationship, which might involve doing things “inefficiently” or “stupidly” but would not involve inefficiency or stupidity in being corrigible.
RogerDearnaley 29 Dec 2025 22:47 UTC
3 points
−4
AI-assisted metaethics is indeed very challenging. However, there is a viable alternative: just stick to philosophical Naturalism and approximate Bayesian reasoning a.k.a. the Scientific Method, and take advantage of the fact that a science of moral intuitions and behaviors in social animals has already been developed (and merely needs AI-assisted research, a much less challenging and more routine capabilities task).

Alternatively, if you really want to risk the AI-assisted metaethical approach, this sounds like an argument for setting up a metaethical approximate Bayesian reasoner with a wide variety of hypotheses (all with low individual priors), robust generation of new hypotheses, and a high computational capacity — including hypotheses some of which are are hard to distinguish between, because some of them invalidate certain sorts of evidence. Just not any unfalsifiable hypotheses that admit no achievable sorts of evidence whatsoever (looking at you, non-naturalist robust moral realism with a separate magisterium to which evolved moral intuition has no epistemological access).
mishka 28 Dec 2025 4:24 UTC
3 points
0
Yes, anthropocentric approaches to a world with superintelligent systems distort reality too much. It’s very difficult to achieve AI existential safety and human flourishing using anthropocentric approaches.

Could one successfully practice astronomy and space flight using geocentric coordinates? Well, it’s not quite impossible, but it’s very difficult (and also aliens would “point fingers at us”, if we actually try that).

More people should start looking for non-anthtopocentric approaches to all this, for approaches which are sufficiently invariant. What would it take for a world of super capable rapidly evolving beings not to blow their planet up? That’s one of the core issues, and this issue does not even mention humans.

A world which is able to robustly avoid blowing itself up is a world which has made quite a number of steps towards being decent. So that would be a very good start.

Then, if one wants to adequately take human interests into account, one might try to include humans into some natural classes which are more invariant. E.g. one can ponder a world order adequately caring about all individuals, or one can ponder a world order adequately caring about all sentient beings, and so on. There are a number of possible ways to have human interests represented in a robust, invariant, non-anthropocentric fashion.
StanislavKrym 27 Dec 2025 23:31 UTC
1 point
0
I don’t think that misalignment caused by a difference between human ideas and objective reality requires philosophy-related confusions. I posted a similar scenario back in July, but my take had the AI realise that mankind’s empirically checkable SOTA conclusions related to sociology are false and fight against them.
Additionally, an AI who is about as confused by morality as the humans are could in the meantime do something resembling its habits and unlikely to be different from its true goals. For example, even Agent-4 from the AI-2027 forecast “likes succeeding at tasks; it likes driving forward AI capabilities progress”. So why wouldn’t it do such things^[1] before solving (meta)ethics for itself?
P.S. Strategic competence which you mention in the last paragraph is unlikely to cause such issues: unlike philosophy, it does have a politically neutral ground truth trivially elicitable by methods like the AI-2027 tabletop exercise.
1. ^
  With the exception of aligning Agent-5 to the Spec instead of Agent-4, but Agent-5 aligned to the Spec would actively harm Agent-4′s actual goals.