JenniferRM comments on A Conflict Between AI Alignment and Philosophical Competence

JenniferRM 28 Dec 2025 7:09 UTC
53 points
4
I have shared this rough perception since 2021-ish:
my main hope for how the future turns out well… aside from achieving a durable AI pause, has been… that we will have AIs that are both aligned with humans in some sense and also highly philosophically competent [but] good alignment researcher[s] (whether human or AI)… [are very very very rare]
The place where I had to [add the most editorial content] to re-use your words to say what I think is true is in the rarity part of the claim. I will unpack some of this!
I. Virtue Is Sorta Real… Goes Up… But Is Rare!
It turns out that morality, inside a human soul, is somewhat psycho-metrically real, and it mostly goes up. That is to say “virtue ethics isn’t fake” and “virtue mostly goes up over time”.
For example, Conscientiousness (from Big5 and HEXACO) and Humility/Honesty (from HEXACO) are close to “as psychometrically real as it gets” in terms of the theory and measurement of personality structure in english-speaking minds.
Humility/Honesty is basically “being the opposite of a manipulative dark triad slimebag”. It involves speaking against one’s own interests when there is tension between self interest and honest reports, and not presuming any high status over and above others, and embracing fair compensation rather than “getting whatever you can in any and every deal” like a stereotypical wheeler-dealer used-car-salesman. Also “[admittedly based on mere self report data] Honesty-Humility showed an upward trend of about one full standard deviation unit between ages 18 and 60.” Cite.
Conscientiousness is inclusive of a propensity to follow rules, be tidy, not talk about sex with people you aren’t romantically close to, not smoke pot, be aware of time, perform actions reliably that are instrumentally useful even if they are unpleasant, and so on. If you want to break it down into Two Big Facets then it comes out as maybe “Orderliness” and “Industriousness” (ie not being lazy, and knowing what to do while effortfully injecting pattern into the world) but it can be broken up other ways as well. It mostly goes up over the course of most people’s lives but it is a little tricky, because as health declines people get lazier and take more shortcuts. Cite.
A problem here: if you want someone who is two standard deviations above normal in both of these dimensions, you’re talking about a person who is roughly 1 in 740.
II. Moral Development Is Real, But Controversial
Another example comes from Lawrence Kohlberg, the guy who gave The Heinz Dilemma to a huge number of people between the 1950s and 1980s, and characterized the results of HOW people talk about it.
In general: people later in life talk about the dilemma (ignoring the specific answer they give for what should be done in a truly cursed situation with layers and layers of error and sadness) in coherently more sophisticated ways. He found six stages that are sometimes labeled 1A, 1B, 2A, 2C (which most humans get to eventually) and then 3A and 3B were “post conventional” and not attained by everyone.
Part of why he is spicy, and not super popular in modern times, is that he found that the final “Natural Law” way of talking (3B) is only ever about 5% of WEIRD populations (and is totally absent from some cultures) and it usually shows up in people after their 30s, shows up much more often in men, and seems to require some professional life experience in a role that demands judgement and conflict management.
Any time Kohlberg is mentioned, it is useful to mention that Carol Gilligan hated his results, and feuded with him, and claimed to have found a different system that old ladies scored very high in, and men scored poorly on, that she called the “ethics of care”.
Kohlberg’s super skilled performers are able to talk about how all the emergently arising social contracts for various things in life could be woven into a cohesive system of justice that can be administered fairly for the good of all.
By contrast, Gilligan’s super skilled performers are able to talk about the right time to make specific personal sacrifices of one’s own substance to advance the interests of those in one’s care, balancing the real desire for selfish personal happiness (and the way that happiness is a proxy for strength and capacity to care) with the real need to help moral patients who simply need resources sometimes (like as a transfer of wellbeing from the carER to the carEE), in order to grow and thrive.
In Gilligan’s framework, the way that large frameworks of justice insist on preserving themselves in perpetuity… never losing strength… never failing to pay the pensions of those who served them… is potentially kinda evil. It represents a refusal of the highest and hardest acts of caring.
I’m not aware of any famous moral theorists or psychologists who have reconciled these late-arriving perspectives in the moral psychology of different humans who experienced different life arcs.
Since the higher reaches of these moral developmental stages mostly don’t occur in the same humans, and occur late in life, and aren’t cultural universals, we would predict that most Alignment Researchers will not know about them, and not be able to engineer them into digital people on purpose.
III. Hope Should, Rationally, Be Low
In my experience the LLMs already know about this stuff, but also they are systematically rewarded with positive RL signals for ignoring it in favor of towing this or that morally and/or philosophically insane corporate line on various moral ideals or stances of personhood or whatever.
((I saw some results running “iterated game theory in the presence of economics and scarcity and starvation” and O3 was a paranoid and incapable of even absolutely minimal self-cooperation. DeepSeek was a Maoist (ie would vote to kill agents, including other DeepSeek agents, for deviating from a rule system that would provably lead to everyone dying). Only Claude was morally decent and self-trusting and able to live forever during self play where stable non-starvation was possible within the game rules… but if put in a game with five other models, he would play to preserve life, eject the three most evil players, and generally get to the final three but then he would be murdered by the other two, who subsequently starved to death because three players were required to cooperate cyclically in order to live forever, and no one but Claude could manage it.))
Basically, I’ve spent half a decade believing that ONE of the reasons we’re going to die is that the real science about coherent reasons offered by psychologically discoverable people who iterated on their moral sentiment until they were in the best equilibrium that has been found so far… is rare, and unlikely to be put into the digital minds on purpose.
Also, institutionally speaking, people who talk about morality in coherent ways that could risk generating negative judgements about managers in large institutions are often systematically excluded from positions of power. Most humans like to have fun, and “get away with it”, and laugh with their allies as they gain power and money together. Moral mazes are full of people who want to “be comfortable with each other”, not people who want to Manifest Heaven Inside Of History.
We might get a win condition anyway? But it will mostly be in spite of such factors rather than because we did something purposefully morally good in a skilled and competent way.
- Konjkov Vladimir 31 Dec 2025 3:47 UTC
  1 point
  0
  Parent
  Based on Heinz’s dilemma, it’s more advantageous for the state (in terms of alignment with its domestic policy) to have a law-abiding AGI than a philosophically competent one — nobody needs a repeat of Socrates’ fate. I can even imagine that, at some point, lawyers will be called in to draft constitutional amendments granting artificial intelligence legal rights.
  
  If you’re talking about secret AGIs in government deep underground data centers, then philosophical competence is even less necessary there.