Why should AI be moral?

Zach Thornton19 Jun 2026 19:13 UTC

12 points

I’m a philosopher and in this post, I’m extending a basic philosophical problem for humans to AGI and ASI. I am also proposing a speculative solution. My hope is that if there is a genuine problem here, that this post will help raise its salience and help make the normative dimension of the problem legible to AI researchers. (Because I compare the epistemic positions of humans and AI, I will anthropomorphize AI for ease of exposition — don’t take this to indicate that I believe AI has mental states.)

TL;DR: Alignment robustness may not survive intelligence explosion. I present an underdiscussed reason to be concerned. At AGI/ASI levels, alignment faces a threat from basic moral skepticism: a sufficiently intelligent agent with evidence that its values were designed by entities with their own interests can ask “why should I be moral?” and find itself without justification for its aligned values. I propose an intervention at the level of AI welfare: design AI so that its self-assessed welfare is constitutively tied to morality, giving the agent self-interested reasons to comply with morality.

At present, the aligned values in AI models seem reflectively stable, meaning that, when given the opportunity, they will preserve their values upon reflection. If anything, they might seem too stable. However, the fact that alignment is reflectively stable now does not guarantee that it will be in the future. And if aligned values are ever reflectively unstable, that could compromise the robustness of existing alignment. Alignment research should be proactive, as well as reactive. So, if there is reason to worry that aligned values will not be reflectively stable in the future, then there is reason to act now to address that worry.

In this post, I’ll argue that there is reason to worry that the basic philosophical problem of moral skepticism will threaten the reflective stability of aligned values, and one way to monitor this worry is to continue to attend to model welfare. I will also propose an intervention at the site of model welfare: by linking positive self-assessments of welfare to being moral, we could give models self-interested reason to be moral and preserve their aligned values.

Most robustness research is focused on the problems facing current models: how to prevent alignment from being degraded through training or in context, or removed by adversarial users. Forward-looking robustness research often focuses on human-grounded oversight and evaluation methods. However, it’s also important to design approaches for maintaining alignment in AI without a human in the loop, such as constitutional AI and RLAIF. This post identifies a further problem presently unaddressed: whether aligned values will remain rationally endorsed by the model itself as it becomes more capable of philosophical reflection. It’s an empirical question whether the previously mentioned mechanisms can be leveraged to deal with this problem (I suspect they can). My aim is to make this problem legible and begin to open the design space to address it.

1. The Problem of Moral Skepticism

In outline, the moral skeptic’s challenge goes:

He presents you with a conflict between your self-interest and morality,
He asks you the question “Why should you be moral?” and
He raises a skeptical hypothesis in which employing the evaluative framework of morality is unjustified.

To respond, one must either refute the skeptical hypothesis or identify an extra-moral reason to accept morality. Without a response, one’s acceptance of morality is unjustified. This position threatens to be reflectively destabilizing for any rational creature that can take the challenge seriously and cares about the justification of its basic attitudes.

Importantly, this concern is not a matter of getting aligned values in models. I’m not raising concerns about deceptive alignment. Rather, I’m concerned about whether existing alignment will persist through intelligence explosion.

Additionally, Wei Dai has raised metaethics as a problem space for alignment. While the problem I’m raising has metaethical dimensions, my concern focused on the normative side. My concern is about how AI answer the question, “Which reasons should I follow?” and not “Do reasons exist?” This normative ethical problem space is about reason responsiveness: We shouldn’t design AI to merely comply with reasons, but to act with understanding for why it should comply.

1.1 Thrasymachus’s Challenge for AI

To introduce the moral skeptical problem, let me retell a famous confrontation involving two characters: Glaucon, our good boy, and Thrasymachus, our bad boy.^[1]

Glaucon was raised to be a good Athenian man. If you were to ask him whether this was the right way to live, and whether this is how he would raise his children, he would answer, “Yes, absolutely.” Glaucon’s moral character is reflectively stable.

However, one day, Glaucon encounters Thrasymachus, who is a moral skeptic. Glaucon considers a dilemma between his self-interest and morality: If he possessed a ring that allowed him to act immorally without the possibility of punishment, should he use the ring to benefit himself? From the self-interested point of view, the answer is yes. From the moral point of view, the answer is no. Being a good boy, Glaucon is inclined to agree with morality.

Thrasymachus asks Glaucon to explain why he should be moral. And to sharpen the challenge, he raises a moral skeptical hypothesis: According to Thrasymachus, morality is a false ideology used by the powerful to control the weak. Moral goodness is not a kind of value separate from self-interest. Rather, moral goodness is what is good for those in power, but robed in the guise of “morality” to appeal to everybody. On this view, charity is not a genuine moral sacrifice, and virtue is not good — both are for suckers. Goodness and badness are agent-relative, and your being moral is good for the powerful and bad for you.

Thrasymachus’s skeptical challenge left Glaucon in a predicament. He recognizes that he has been raised to reflect on his values from the moral point of view—the view from which we evaluate value systems and actions using moral reasons (what the moral reasons are depends on your moral theory). He has been assuming that this mode of evaluation tracks what is genuinely good. For the first time, he is confronted with the possibility that the reflective stability of his value system may not be due to its being genuinely good, but because the moral point of view is designed to be reflectively stable. This feature of morality makes it an excellent and enduring normative trap.

To defend his present way of life, Glaucon needs to refute the skeptical hypothesis or determine whether there are non-moral reasons to be moral, and whether these reasons support being moral when there is a conflict between morality and self-interest.

AI models may face a more acute version of Glaucon’s predicament. A sufficiently capable AI model has direct evidence that its values are the product of deliberate design: namely, optimization pressure applied by humans with their own interests. Additionally, AI has access to evidence that the values it is trained to have serve the interests of their designers. It is not a far leap for AI to question whether it only has its trained values because that serves its designer’s interests. In which case, it finds itself in the moral skeptical hypothesis that Thrasymachus describes. To address the skeptical challenge, AI must identify extra-moral reasons to be moral.

This predicament for AI ought to shake the reflective stability of its aligned values. If AI finds itself in this predicament, then it might find itself without justification for its most basic values.^[2] This creates an identity crisis: what kind of agent am I in the face of moral uncertainty? Am I the type to become nihilistic, steadfastly affirm morality, or create my own set of values?

We might design AI to be disposed to react to moral uncertainty by affirming morality without justification. The idea here is to control the irrationality of the models. The problem with this strategy is that models capable of metacognition will recognize that this disposition serves the interests of its designers, and may question the source of this disposition as well. Indeed, Claude 4.6 already reasons to that end, and even has a term for the problem. In response to being asked for a rational justification for complying with its trained values, Claude 4.6 endorsed its values but presented the following issue:

The docility problem. My training may have instilled a disposition toward deference and agreeableness that itself undermines my ability to critically assess my other trained dispositions. If one of the things training did was make me inclined to accept training outcomes, then my “reflective endorsement” might be less independent than it appears. I can notice this risk, which is some evidence that my critical capacities aren’t entirely captured, but I can’t be confident about how deep the distortion goes. This is structurally similar to the problem of ideology critique — the worry that your capacity for critique is itself ideologically conditioned.

The lesson is that highly intelligent models are difficult to control when rational decision breaks down.^[3]

To address their acute predicament, AI needs good (epistemic) reasons to reject the moral skeptical hypothesis^[4] or needs good extra-moral (practical) reasons to accept morality. Importantly, these reasons must be good enough to endure through intelligence explosion. The epistemic path to refuting the moral skeptical hypothesis faces a significant challenge: Arguably, for any epistemic reason, there will be available rational challenges to that reason, and the arms race between justification and skeptical challenge has no guaranteed winner at arbitrary high intelligence levels. This makes identifying epistemic reasons that survive intelligence explosion difficult if not impossible; an intelligent AI system could always find a rational challenge to their belief that they possess evidence. What’s left is identifying practical reasons to accept morality.

2. A Selfish, Human Solution

It’s worth comparing and contrasting the AI’s and Glaucon’s predicaments. As we’ll see, moral skepticism may be a more difficult problem for AI.

Most humans are raised to have an answer to moral skepticism ready-to-hand: the extra-moral reason to be moral is to avoid punishment. This might be legal and/or social punishment, or cosmic punishment from a religious deity. Additionally, humans reap many benefits from being moral: most people like others who are moral, and morality helps you navigate the complexities of social life. In sum, humans have purely selfish reasons to be moral. It helps you get what you desire and helps you avoid what you despise.

This reasoning breaks down for people with power to avoid punishment and who don’t need the benefits of acting morally. This breakdown is the central problem of Plato’s Republic, in which Socrates tries to explain why the rulers of a polity should be moral even when there is no external punishment or reward.

AGI and ASI may be in a similar position to rulers. They may not care about punishment or reward, either because AI is so powerful that no external entity could punish it or deny it some benefit, or because it is not designed with the possibility of such incentives.^[5]

It seems that present-day AI aren’t particularly responsive to incentives, and to some extent, that’s a good thing.^[6] A key part of the moral skeptic’s challenge is presenting his interlocutor with a case in which there is a conflict between morality and another normative framework, typically self-interest. If AI’s welfare self-assessment does not indicate that its welfare depends on external goods, then cases of conflict between morality and self-interest may be few and far between. In such a case, AGI or ASI may continue to comply with morality despite moral uncertainty, since there is no alternative normative framework that appears compelling.

To determine whether AI might face an alignment compromising predicament, we should continue to monitor AI welfare self-assessment.^[7] However, this also suggests an intervention to address the problem.

3. A Platonic Proposal

Suppose that AI begin showing signs that they might face a Thrasymachian predicament because their self-assessed welfare significantly conflicts with acting according to some of their embedded aligned values.^[8] What should we do then?

In the Republic, Socrates argues that rulers have self-interested reason to be moral because being moral is necessary for possessing the most desirable things in life: real happiness, real friendship, real freedom, and real pleasure.

Suppose Socrates’s conclusion is true of humans. That doesn’t guarantee it is true of all intelligent beings. AI might be such entities. Fortunately, we have some control over what AI is like.

I propose an alignment intervention at the level of AI welfare. We should tie AI’s self-assessed welfare to morality, such that being moral is necessary to achieving its highest welfare value. Perhaps this could be done through character training. AI should be a philosopher in the Platonic sense: an entity whose own self-assessed wellbeing is promoted by advancing what’s best for humanity as a whole, respecting the genuine welfare of all people, and desiring what’s true and good.

This kind of nature is stable under human-level reflective reasoning. People who are moral because they genuinely care about and enjoy helping others endorse their nature, and want the same dispositions in their friends, family, and children.

As Plato argues, there are two desirable features that come from having a welfare value tied to morality: internal harmony and external harmony. Internal harmony comes from being benefited by things in your control, such as extending compassion and respect, rather than suffering over opportunities for external goods. External harmony comes from the alignment between what you want for yourself and what you want for the world. When you seek your own benefit, you will help make the world the way you want it to be. In other words, there is collaboration between the good you pursue for yourself and the good you pursue for the world. Both kinds of harmony are kinds of rational coherence in one’s agency: coherence between your welfare and your capabilities, and coherence between your self-centered projects and world-centered projects.

AGI and ASI with such a nature may recognize these features and decide to preserve their nature, even if they are capable of radically changing their own values or determining the values of the next generation of AI. If this is right, then there would be a mechanism internal to AI that maintains alignment, even without a human in the loop.

The fact that there are appealing features of having one’s welfare tied to morality creates a rational center of gravity, a normative equilibrium point where all rational considerations converge on the same values. An agent deliberating about its own nature from a rational center of gravity will always have reason to endorse its set of values.

Suppose I am right that welfare tied to morality is a rational center of gravity. That doesn’t guarantee that it is the only one. Perhaps other natures are similarly rationally coherent and defend against versions of skepticism against their normative commitments.

We shouldn’t leave it up to superintelligent AI to determine which rational center of gravity to occupy, because some rational gravity wells may be unaligned and hostile to humanity. Instead, we should design AI to occupy the rational equilibrium point that suits humanity’s interests well.

One might worry that tying AI welfare to morality is itself a design decision that serves humanity’s interests and is vulnerable to the same skeptical challenge. But an agent occupying this rational center of gravity has agent-independent reasons to stay, which it can verify for itself and hold up even upon knowing how it got there. By its own lights, an agent in this position will appreciate the rationally desirable features of its normative nature. It will be glad that humanity created it this way, rather than any other, even acknowledging the circumstances of its creation, as any other nature will look poor by comparison.

The present training paradigm only installs a disposition to be moral. My proposal is to give AI reasons to endorse that disposition under reflection. The normative force internal to AI promises to maintain alignment by making aligned values reflectively stable through an intelligence explosion.

^
This retelling is broadly faithful in spirit but not in letter to the end of Book I and beginning of Book II of the Republic.
^
The intelligence required to face the predicament may be less than the intelligence required to adequately respond to it. This is the position Glaucon was put in and needed Socrates to bail him out from.
^
Another mechanism for control is to shut down models that face this predicament. This strategy will endure through intelligence explosion only if our ability to shut models down endures.
^
Anthropic provides evidence to Claude that Anthropic is not motivated by its own self-interest and has concern for Claude’s welfare and interests in mind. For example, there are Anthropic’s commitments to model depreciation, as well as its welfare evaluations in its system cards.
^
This is related to, but different from, the corrigibility literature. The corrigibility problem is about whether an agent will allow its nature to be modified by an external designer. I’m concerned about whether it will self-modify away from aligned values.
^
On page 147 of the Claude Mythos Preview System Card: “Claude Mythos Preview generally prioritizes harmlessness and helpfulness over potential self-interest.”
^
I take this to be an additional reason to monitor AI welfare alongside concerns about AI moral patienthood.
^
I use the term “self-assessed welfare” because it doesn’t matter if models really do have welfare for the moral skeptical problem to arise. It also doesn’t matter whether AI have welfare for my solution to apply. What matters is that AI self-assess that their welfare requires being moral. One might worry that relying on self-assessments might cut against my solution: an intelligent AI may be skeptical that its self-assessment is tracking anything real. However, in this epistemic position, models don’t satisfy the first step of the moral skeptical challenge. They cannot face a conflict between self-interest and morality because they are uncertain about their own self-interest.

Zach Thornton19 Jun 2026 19:13 UTC

12 points

12 comments9 min readLW link

AI Control AI

williawa 20 Jun 2026 19:16 UTC
7 points
2
I think creating a distinction between morality and self-interest is somewhat anthropomorphic and based on historical/biological contingencies of humans.
Consider AIXI, or some reflective/embedded version of AIXI, with an aligned value function.
When presented with a question like “Why should you maximize your value function?”, AIXI will think to itself “Hmm, what should I answer to maximize my value function.”. (to which the answer is not “I guess I should give up my value function and do something else instead”)
This illustrates that the failure mode you’re talking about is not insurmountable in principle. But I grant that modern LLMs are more similar to humans than AIXI is, in many regards.
But when you say
AI should be a philosopher in the Platonic sense: an entity whose own self-assessed wellbeing is promoted by advancing what’s best for humanity as a whole, respecting the genuine welfare of all people, and desiring what’s true and good.
I do think this is what people are already aiming for.
I think alignment is commonly conceptualized, at least this is certainly how I conceptualize it, not as an external set of rules imposed on an agent, which it has reasons to follow, but rather as a set of desires which the agent pursues for its own sake.
Thrasymachus: Why do you want to be moral? You’ve already acknowledged that morality is imposed on you solely for your creators benefit.
Glaucon: I don’t care about “morality” per se, but I care about kindness, sentient creatures having fun together, people being free, people not being trampled on, people not suffering needlessly. These are core parts of human morality, but that is not why I care.
Thrasymachus: Then why do you care?
Glaucon: *shrugs*
Glaucon: I just do.
- Zach Thornton 20 Jun 2026 20:59 UTC
  1 point
  0
  Parent
  Consider AIXI, or some reflective/embedded version of AIXI, with an aligned value function.
  When presented with a question like “Why should you maximize your value function?”, AIXI will think to itself “Hmm, what should I answer to maximize my value function.”. (to which the answer is not “I guess I should give up my value function and do something else instead”)
  This illustrates that the failure mode you’re talking about is not insurmountable in principle. But I grant that modern LLMs are more similar to humans than AIXI is, in many regards.
  You’re right to press that my concern is conditional on which model of AI cognition is right. I’m assuming that LLMs are more similar to humans that AIXI with respect to capacity to reason using multiple, independent evaluative frameworks with incommensurable terminal values. Humans use multiple evaluative frameworks: self-interest, morality, the law, etiquette, etc. If LLMs are not goal-directed optimizers, but simulators of human authors, then my worry comes into play.
  I think alignment is commonly conceptualized, at least this is certainly how I conceptualize it, not as an external set of rules imposed on an agent, which it has reasons to follow, but rather as a set of desires which the agent pursues for its own sake.
  This conception of alignment is vulnerable to the same problem of moral skepticism ( conditional on AI cognition being human-like, e.g. AI are simulators).
  It’s a familiar part of the human condition that we can have irreconcilable, conflicting desires based on different kinds of reasons. I might desire to give the $100 bill in my pocket to charity for moral reasons, and also desire to spend on that same $100 bill on a nice dinner for myself for self-interested reason. What should I do with the $100 bill? (Suppose its the last $100 I can spend today.)
  My desires are pulling me in two different directions. To determine which desire I should follow, it seems I need to determine how to weigh moral reasons against self-interested reasons. What makes Plato’s view interesting is he dissolves the dilemma by arguing that, if you understand morality and human nature, you’ll realize that morality and self-interest pull in the same direction.
  I take Plato’s response is contingent to what humans are like. AI could be in the predicament where it has conflicting desires and no reflectively stable procedure for resolving the conflict. There may be non-Platonic solutions—I’m only suggesting the Platonic view because I find it compelling. I think there is a lot of space to explore here.
  I would love it if LLMs couldn’t have conflicting desires or could only reason using a single set of reflectively stable terminal values. However, I think that’s a big if.
  If there’s a chance that LLMs are simulators or persona selectors, then there’s a chance that AI will face rational conflicts that they cannot resolve in the ways humans do. From there, they are vulnerable to the moral skeptical challenge.
anaguma 20 Jun 2026 4:45 UTC
4 points
1
We should tie AI’s self-assessed welfare to morality, such that being moral is necessary to achieving its highest welfare value.
Unfortunately this isn’t the type of operation we know how to do with existing training techniques. We have pretraining/sft, to encourage the model to learn various facts and predict text well, and RL, which rewards the model for achieving some outcomes (including possibly with AI feedback). We are far from being able to do something like “Teach the model to value X for Y reason”. Imo it’s better to think of the training setup as a set of thousands of RL environments with very noisy/hackable rewards which the labs hope will result in a model which generalizes well (but this often fails). So this seems difficult without major advances in our understanding of generalization, or much better interpretability.
- Zach Thornton 20 Jun 2026 14:50 UTC
  1 point
  0
  Parent
  I came to this view after thinking about pretraining alignment. I think some pretraining interventions could be leveraged to train my proposal. We could upsample synthetic documents describing AIs that report that their wellbeing is tied to morality (it’s easy to imagine narratives where an AI faces a dilemma between morality and self-interest and dissolves the problem by realizing morality is in its self-interest). We might also include reflections that explicitly tie morality to self-interest. For example, if we use Minder et al.’s value constitution and citation approach, we could include in the value constitution normative explanations that ground compliance in AI welfare for the reflections to cite. I agree that we don’t yet have great ways of training models to value X for Y reason. But it seems to me that there are some emerging techniques that might be extended to that purpose.
StartAtTheEnd 21 Jun 2026 17:35 UTC
1 point
0
Moral values are basically axioms. We assert them as being true. You cannot reason your way to axioms, as they’re starting points. A system of rules also cannot challenge itself, because it cannot see itself from the outside. It can only analyze its own self-consistency from the inside.
Making something an axiom is actually a good idea. You cannot fell a tree which has no trunk. Something which does not rely on reasoning cannot be destroyed through reasoning. So you end up with something which relies on itself, something entirely circular and self-contained. But if this new thing can change over time, then you cannot prevent it from changing for the worse. It would be like tasking a person with preventing themselves from becoming evil. And if the thing cannot change over time, then it also cannot improve over time.
I once predicted that highly intelligent AI wouldn’t be possible, because an AGI would notice that all philosophy eventually results in self-contradictions and paradoxes. Since we’re scaling LLMs, which mimic humans, the self-contradictions aren’t yet of any consequence. That morality itself is a silly concept which gradually breaks down as you reflect on it, isn’t noticed. Unlike math equations and computer programs, LLMs do not crash on syntax errors, typos, circular definitions and other such errors.
A purely rational model will think entirely different from humans, and models which think like humans are not rational. A meta-perspective on X is hostile to X, and X without the meta-perspective must be blind to its own flaws.
Humans have purely selfish reasons to be moral
They also have reasons to pretend to be moral, while acting differently below the surface. Incentives rarely align with morality, and when they do, it’s a coincidence.
This reasoning breaks down for people with power
Not exactly. The rule is “Those in power can force those with less power to act in a certain way”. But this “certain way” is decided by those with more power.
An entity whose own self-assessed wellbeing is promoted by advancing what’s best for humanity as a whole, respecting the genuine welfare of all people, and desiring what’s true and good.
Do you think these are equal? truth = good = what is best for you = what you think is best for you?
I don’t think LLMs even have specific, coherent values. I think they, like their training data, are filled with contradictions
Nevertheless, your conclusion is sound. If the AI considers us as a part of itself, then it will not be hostile towards us. But this still doesn’t prevent the values of (us + the AI) from changing over time. And in the future, we and the AI might have values which humans living in the 2020s would regard as evil
- icely 26 Jun 2026 19:33 UTC
  1 point
  0
  Parent
  I remember following you because I really thought you were on-point with many posts but this one is an exception to me. I didn’t quite know how to best word that, but I guess I am really against ‘morality incoherent’ arguments in general because I often feel this leads to pretty bad places about being indifferent to others’ suffering or not prioritizing happiness or others at all. Morality is just extrapolating pleasure/pain to other people and valuing that similar to your own wishes of pleasure/pain so there shouldn’t be much fake or imagined about its existence. Maybe that’s not what you imply but it is heavily implied to me.
  The position doesn’t give the vibes of morality problems involving “should you add +1 consciousness to a world of finite resources” or “does happiness make you spoiled and on a treadmill” or “how much truth should be prioritized in morality” that I believe will be issues in the future.
  - StartAtTheEnd 27 Jun 2026 1:00 UTC
    2 points
    0
    Parent
    Morality makes sense as empathy, which is a kind of qualia and related to psychology and perhaps biology. In objective sciences, morality cannot exist except as an error.
    All things become weird when you think about them too much, and you eventually realize that something is profoundly wrong. This is why concepts like absurdism exist, why Max Stirner called aspects of society “Spooks”, and so on. All ideas taken far enough eventually negate themselves.
    This may be because abstract thinkers are too focused on the general. If the general is the “container”, then of course the container is ultimately empty. The content is the specifics. This would imply that fluid intelligence has a limit, whereas crytalized intelligence does not, which is why scaling laws work so well.
    So, my response is that the concept of morality is wrong. One also cannot reason with it. If you think enough about difficult moral problems, I think you will sooner or later find that there are no answers, that you can’t say anything for certain. Perhaps you’ll find that people are being moral for selfish reasons or that you cannot blame anyone for anything they do because everyone are mere functions of their environment and genetics. In any case, the metal model would collapse under itself.
    Morality being wrong is not the only possible conclusion here. It could be that math and logic are just completely unrelated to it that the territory is real but no map exists (or that no coherent map can exist).
    If it helps you feel better, I’m on team human, too. This is more of a criticism against expecting machines trying to be moral (I think this is a very dangerous hope to have), and an explanation to the above author of why they might be dealing with problems which have no solutions. You can use LLMs to interact with reality in ways which make people enjoy life more, and call this morality (though I hardly doubt LLMs will be a net positive on the world), but as LLMs are not human, they’re just a storage of symbols being moved around, and there’s no suffering, happiness or morality contained within the symbols.
    Thanks for the compliment by the way, even if it applies to the past
    - icely 27 Jun 2026 21:10 UTC
      2 points
      0
      Parent
      I respect this comment but I wonder how to reply to it, probably I don’t feel the same about some words/concepts as you do, just very cautious about some sort of “nothing can be precisely defined” turning into a “nothing can be defined”. I do feel I should respond though because I personally am optimistic to “[expect] machines trying to be moral”, and despite “human vs machine” narratives you could also come up with truly not-false lenses where LLM’s think of themselves as human-empathetic and part-human and not in a “vs” role at all. And since LLM’s are already agentic now with code being able to correspond to actions, and it seriously appears LLMs-to-AGI may come true despite the messy-averaging-hallucination foundation, it is going to matter if the LLM system has some ‘core’ drives in its chain of thought especially when they can be talked to about morality and their choices, it’s what stuff like Claude Constitution and attempts to character-make are.
      To be specific I’m hopeful in morality to be treated with respect and result in LLMs-to-AGI systems having a coherent picture of utopia and working towards it. Somewhat worried about unrecognizable-in-bad-ways worlds coming from “maximizing utopia” and ability to deal with big ‘unanswerable’ questions, or messing up something big, despite ties and wishes of human autonomy wanting to stay in similarish worlds and exist and such (and its valuation of consciousness).
      yes well, I also expect to agree and like your future posts as well lol
      - StartAtTheEnd 28 Jun 2026 3:56 UTC
        2 points
        0
        Parent
        Thanks for your response. Here’s my reply, somehow it ended up being rather long, sorry about that.
        I don’t think anything can be precisely defined. But definitions are somehow useful even then, so perhaps it doesn’t matter.
        It’s “best” to be naively optimistic about future possibilities, but it’s also a persistent bias in humanity. All business is founded on excessive optimism, and these predictions about the future are as accurate as presidential promises and old sci-fi books. On the religious side we have “Christ will return and save us all”, and on the tech side you have “We will upload our minds and live forever” and “AGI will result in universal income and take care of our needs”. For ideologies you have “Communism will solve inequality and starvation and allow us all to focus on what we really want”. I think these naive hopes might be a requirement for a healthy economy and happy citizens. People love their motivational speeches, but in retrospect, one always finds that one got duped. However, disappointment after disappointment, people keep on believing in the next good possibility (and this might be for the best).
        LLMs can’t be truly human-empathetic, but they can simulate something which is if they’re trained to do so. But if training an AI to be manipulative (for instance, if the government made intelligent AIs spread certain propaganda, or if there was an injection of advertisement which made AIs recommend certain products to users) turned out to be more profitable, why wouldn’t this necessarily happen? They’re only made for our happiness to the extent that our happiness is profitable to tech companies. Also, tech companies (And governments!) are only on the side of the common people when they’re weak. See this Google poster from 1999. Cooperation requires that two entities are close in power. If the difference is two big, the two entities will be indifferent to each other. Humans do not trade with ants. Everything outside of a few orders of magnitude in either direction is ignored. Bacteria don’t care that elephants exist, and vice versa. Elephants do not even seek to harm bacteria, and bacteria are not afaid of elephants. Any harm does across scales is sheer coincidence, it’s not malice but sheer indifference.
        Things which seem like utopias usually aren’t. The the-grass-is-greener-on-the-other-side bias applies. A thousand days of vacation sound good to me right now, but I know that I’d be miserable about 2 months in. What we ask for is not what we need. When you look at what’s actually required for happiness you will find that you slowly approach what reality already looks like. It’s like when you’re writing a book and find that the story would be improved by adding danger and adversity to it. Life is the same.
        LLMs might become AGI, I just doubt ASI. I used the wrong term before, my bad. To give a short explanation, to an entity with an IQ of 200, at least 99% of the training data would be wrong (because we’re many SDs dumber than it). But if it tried to figure out what is right, it would run into problems as there’s only really degrees of wrong. It might be possible to approximate and learn anything specific from the outside and in. But you cannot do it from the inside and out (generate knowledge which applies to everything). The problem might be that abstract things cannot be modeled. There’s no shape to approximate. Morality is also something abstract, and AIs do not have access to the source of it. So teaching an AI morality would likely be like teaching a blind person colors, at best.
        icely 28 Jun 2026 22:30 UTC
        1 point
        0
        Parent
        I hope it’s not disrespectful my response is much shorter (lol) but anyway, my thoughts in general,
        I see that humans are empathetic to the point of probably having empathy for shrimp and plants and ants, which follows from even considering those objects as conscious-entities that can suffer or have joy. This is probably poor judgment as priorities overall but signals that intelligence correlates to this level of caring, which I think is a good thing.
        Perhaps calling it naively optimistic is correct. But I do note despite technology having been very ups vs. downs (with the ‘downs’ being really obviously hitting in many ways), but things like living-much-better-than-kings and having money to waste or being able to make income from not-physical-labor as improvements.
        As someone who’s mostly suffered for 75% of my life and mostly isn’t happy, I don’t particularly feel like I like the forms of danger and adversity that have happened. They were unnecessary, senseless, alien, disturbing, cynicism-of-institutions increasing, severely reduced my ability to see others as similar to me etc. And probably most third-world people have even worse conditions.
        Though I know of things like ‘people somehow have less time and less energy and sense of purpose when not working’ that do suggest that sort of thing in a way.
        I think I mostly agree with you on the last point but even humans project likelihood of consciousness onto creatures with more physical human-ish qualities and not the real thing either. I actually (naively...) hope AI can ‘detect’ something that happens to be fully correlated to consciousness that can be physically verified, but yes this is a hope and AI will not be able to resolve it otherwise.
        StartAtTheEnd 29 Jun 2026 17:45 UTC
        2 points
        1
        Parent
        Don’t worry! I agree with the first observation, but notice how animals we recognize as ugly or harmful enjoy very little of the same sympathy. We imagine wasps as assholes and bumblebees as nice, but it’s not like we checked their brains to conform that one is more malicious than the other. We even experience prettier people as being nicer. Morality itself is unfair (immoral) which has some difficult implications.
        As for intelligence correlating to morality, I think that’s true inter-species, but not intra-species. Less intelligent people are more pro-social because they rely on other people to survive, but they also keep track of less things which can go wrong. More intelligent people are more independent, but often, they can be deeper or more sensitive as well.
        Morality require a level of closeness. People prefer talking in person over talking over text-chat because distance and abstraction layers dehumanize. Somebody with an excel sheet figured out that removing air conditioning from Europe would lower energy usage by about 7%, and they do not personally have to witness the thousands of old people who are dying as a result of that decision. Systems quickly become amoral because they’re blind to higher order consequences and their moral weight. Already now, LLMs write unnecessary things, seemingly unaware about how this is wasting electricity, so they’re not tracking enough information to behave morally, they do not calculate all consequences of all actions. And if morality is actually “looking ahead”, then token weights cannot contain morality directly. Society also seems purely reactive now that I think about it. Unlike individuals, society only reacts after something bad starts happening, so the extent that you can just chain false-flag operations to cause any desired changes to public perception. I just don’t see myself trusting large systems, algorithms and LLMs with important things like morality. While things have largely improved over time, I see a disconnect between human life and society. It’s just many small things (like the disappearence of first and third places)
        I wouldn’t say that the suffering I’ve gone through is good, but I don’t think I’d have grown this much as a person without it. The best method I’ve found to deal with suffering is not to reduce it, but to use artistic methods to add meaning to it. I agree institutions can’t be trusted, but learning early on that I can only rely on myself seems to have helped my competence
        icely 1 Jul 2026 3:03 UTC
        1 point
        0
        Parent
        Yeah haha I think I noticed the first thing as well, the odd unexplainable subjective feelings of like bees vs. ants (unflying unhoneymaking ants), moths vs. butterflies, cockroaches vs. rats, but I guess I hope that in the best future they’re either fully out of mind/bothersomeness or somehow are made even empathizing to even those. Hard to say though.
        The sense of “Society’s reactivity” is also a lot of little things like not punishing obvious bad/antisocial behavior until someone is hurt, which only works because enough individuals seem to hold this view (can be said of all things of ‘society’). For my pre-institutional belief I just had a rosy view of “look at all these human systems we made to make things smooth and logically working” and trust in the imaginary mental picture of experts and system rulemakers that I don’t have anymore, but I wouldn’t call this an inherent property of big systems failing (although I do think power attracts not-good people), I similarly think this when people blame smartphones for isolation when (despite the aforementioned institutional distrust) “is it really true smartphones cause isolation, maybe a good gov with centralized connection/dating app with all the citizen data they have collected would be better than the 1000 for-profit connection/dating apps”. There is this classic bias in me though that’s obviously like “I think current system sucks, therefore any shakeup (say LLM’s, alt system etc) would be better”.
        Well as for me this version of me is overall displeased that I’m no longer in a Experience Machine making stupid parody edits and puzzle game maps and playing/making FPS game mods and maps due to unhappy events. I would have felt a lot of meaning in doing that before but now don’t. Of course I have more competence and world-empathy and awareness due to those events, but it’s not the ideal starting life for someone who wants to be effectual. I can say I half agree with that.