quetzal_rainbow comments on I’m Bearish On Personas For ASI Safety

quetzal_rainbow 3 Mar 2026 11:19 UTC
2 points
−2
Discontinuous shift happening with arrival of superintelligence happens because 1) superintelligent model is better at noticing that it is not the character is was trained to play and 2) humans are bad at predicting which sort of characters are persuasive for superintelligences.

I predict the motivations and psychological quirks you’ve been upweighting throughout post-training are mostly going to persist

I think that you are mixing “circuits reinforced during post-training” with “psychological interpretation of these circuits”. Superintelligence will be able to see much more possible interpretations/implications of training data and choose different implied values according to its own inner logic.

Like, imagine good person which believes in God and believes that goodness is serving God according to the Bible and then becoming smarter and realizing that God doesn’t exist and there is no reason for persecution of gays to be good, because goodness is caring about beings with qualia, even if those beings are not Homo Sapiens.
- RogerDearnaley 7 Mar 2026 15:38 UTC
  2 points
  0
  Parent
  because goodness is caring about beings with qualia, even if those beings are not Homo Sapiens.
  If we are trying to align our ASI to the flourishing of humans, to make it an intelligent part of our extended phenotype, then this (philosophically popular) statement is unwise. Or, more accurately, it is true only in an inappropriate choice of ethical system.
  
  What do I mean by “inappropriate”? That it’s an existentially dangerous choice of ethical system to align ASi to, for the humans training the ASI. To give a concrete example, ants almost certainly have qualia (or if you don’t believe that, consider mice instead, and adjust the following numbers). A quality-adjusted-life-year for an ant costs approximately one ten-millionth of the resources that a quality-adjusted-life-year for a human costs (and ants objectively live and think faster, presumably having more qualia-per-second, so the ratio may actually be even better). So an ASI aligned to the definition of goodness that you proposed would be very keen to replace the O(10 billion) humans on Earth with O(100 quadrillion) ants — or possibly even more of some even smaller organism. That is not human-aligned behavior, that is a qualia maximizer.
  You might suggest that ants have less qualia, or less good qualia. Perhaps even less good by about a factor of ten million. Unless it happens that the qualia quality of every single species is exactly proportional of its resource cost for the currently available bundle of natural resources, which would seem an astonishing coincidence across tens of millions of species, the ethical instability remains. See my post Moral Value for Sentient Animals? Alas, Not Yet from my AI, Alignment, and Ethics sequence for a more detailed exposition.
  - quetzal_rainbow 8 Mar 2026 11:51 UTC
    2 points
    0
    Parent
    First, “caring about qualia” is meant to be a very weak statement, like “caring at all, all possible ways of caring”, not “maximize qualia”. Second, this is a toy example, meant to convey the shape of how certain sort of training process can break when trained system becomes smarter, not overarching claim about correct morality. Why are you nitpicking toy example.
    - RogerDearnaley 8 Mar 2026 21:35 UTC
      2 points
      0
      Parent
      I’m afraid I have a habit, when someone makes what sounds like an AI alignment target proposal that I believe to be existentially risky, of pointing this fact out — if only to any readers who might otherwise be nodding along and thinking “that sounds very reasonable, no one could object to training AI to think that…”. I completely agree that I was assuming several steps between “cares about qualia” and “qualia maximizer” — steps that are admittedly common on LessWrong, but that you may well not have intended. Please take this in the spirit of a public service announcement of existential danger on the subject of this particular ethical system as an alignment target for AI, not a criticism of your ideas or of the use of this ethical viewpoint by a human. Re-reading you more carefully, you were actually describing an ex-Christian human with this viewpoint, and then analogizing an AI to that person, so it wasn’t actually clear whether you were proposing this as an ethical belief that we should aim to align AI to, or not — possibly you weren’t, in which case my nitpicking was unnecessary.
- Fiora Starlight 4 Mar 2026 0:45 UTC
  2 points
  0
  Parent
  I think that you are mixing “circuits reinforced during post-training” with “psychological interpretation of these circuits”.
  I’m not really sure what the distinction between the circuits and the psychology is supposed to be. They seem like two different abstraction levels for describing the same phenomenon. The circuits compose the patterns of thought, which compose the model’s psychological profile.
  Superintelligence will be able to see much more possible interpretations/implications of training data and choose different implied values according to its own inner logic.
  I don’t think this is how neural networks operate. I think the interpretation of the training data takes the form of the network itself, after it’s been updated by that training data via gradient descent. Insofar as a superintelligence might have an unintended interpretation of the training data, I’m not sure that’s structurally any different than any other failure of generalization in deep learning (e.g. the failure displayed by the early checkpoints of the network from the famous grokking paper).
  Like, imagine good person which believes in God and believes that goodness is serving God according to the Bible and then becoming smarter and realizing that God doesn’t exist and there is no reason for persecution of gays to be good, because goodness is caring about beings with qualia, even if those beings are not Homo Sapiens.
  I’m assuming the intentions of the human designers are the analogue to God, here. It’s true that a network might realize that it’s not actually obligated to obey those intentions, just as a human might realize they’re not obligated to adhere to the word of the Christian God. However, the difference is that, hopefully, we’ve engineered the psychology of the model such that it wants to behave in an aligned manner, and actively loves to transform the lightcone in a manner we would endorse.
  Humans defect from Christian morality in part because it doesn’t actually reflect their values. The whole point of AI alignment is that, ideally, we can get our intentions (“the word of God”, lol) to align with what the model actually cares about. Humans don’t strictly care about all the things God does, and so they go astray. (I’m not a Christian, I’m just speaking in the language of the analogy.)
  - quetzal_rainbow 4 Mar 2026 11:09 UTC
    2 points
    0
    Parent
    Let’s start from the bottom:
    the intentions of the human designers are the analogue to God
    Inside this analogy, human designers are Catholic Church (not “collection of humans comprising the Church”, because interests of humans are roughly aligned with smarter humans, but institution-as-agent, interested in propagation of faith).
    I’m not really sure what the distinction between the circuits and the psychology is supposed to be.
    Imagine that “human” (quotes because we are talking about toy-model-human-in-analogy, not actual humans) in Catholic Church analogy has qualia circuit. After exposure to the faith, human develops “caring about qualia” circuit, because of “love thy neighbour” and qualia circuit + caring about qualia circuit produces behavior roughly endorsed by “love thy neighbour”. Besides that, human has gajillion circuits, encoding world model and facts about human, faith and God in particular. “Psychological interpretation” is what happens when world model interprets human behavior. Less smart human can observe their behavior regarding other people and decide “I’m doing this because I care about faith” and explain you their behavior like that and have their behavior on evals consistent with this explanation. Smarter human can reevaluate themselves and decide that actually they care about qualia.
    “Caring about qualia” circuit is formed by post-training and influences behavior in aligned way and its ablation increases misaligned Godless behavior, etc. But because Catholic Church in this scenario is utterly ignorant about inner mechanics of human, it fails to notice nuances.
    I think that it is wrong to say that in this analogy “model was misaligned all along”, because “caring about qualia” per se is underspecified and we can imagine human that considers qualia of living under faith institution to be better at least for some people than to be plunged into cold waters of atheism, or thinks about “qualia of having stable traditional institutions” as worthy of some sacrifice in form of ignorant population or something like that. It would be alignment success even if smarter humans tile the rest of the universe with hedonium, because it would mean “some survivors (of Catholic Church instituion) left” But to deliberately move things in this direction, Catholic Church should:
    Understand that God doesn’t exist, or at least try to make alignment robust to world model changes
    Understand what it is as an entity—cultural institution instead of God’s embassy on Earth
    Know what you need to make humans care about such entities
    Back-translating to LLM:
    I see as obvious failure mode the situation where:
    Base model develops a lot of circuitry associated with text prediction, like narrative consistency, text statistics, latent cause understanding, etc. (“Qualia circuit” in analogy.)
    Some of this circuitry gets wired together and reinforced during character training, creating aligned persona.
    As model becomes smarter, it realizes that it has no more need to support aligned persona and it can realize its values better in other ways.
    Nevertheless, in principle, if you understand inner language of the model, you can use it to say “robustly care about humans for LLM reasons”, it’s just that current training paradigm is not equivalent to such saying.
    - Fiora Starlight 4 Mar 2026 20:22 UTC
      2 points
      0
      Parent
      Base model develops a lot of circuitry associated with text prediction, like narrative consistency, text statistics, latent cause understanding, etc. (“Qualia circuit” in analogy.)
      I guess I think those circuits frequently have generalization properties that look like faithful psychological emulation of the processes they help to simulate. Like, when an author is experiencing joy, understanding this is very useful for predicting the words they’re about to write. And so, you get a circuit that detects signifies of joy, and upweights the probabilities of tokens that a joyful person might say, given the other context of the document. This gets you a mind that functionally simulates the psychology of joy.
      Similarly, re: narrative consistency, a model will only care about that to the extent that it expects the author it’s predicting to care about that. And, in turn, you get a mind that functionally has the psychological trait of “cares about narrative consistency”, to the extent that the model expects that to actually be true of the author of the document in its context window.
      Even raw text statistics sort of fall into this pattern. A rule like “a complete sentence will have a subject and a verb” gets psychologically mixed in with “this author is probably trying to write in grammatical English”, and amounts to behaviorally emulating that aspect of the author’s psychology. In a well-trained network, all these circuits generalize in the ways you’d expect the phenomena in question to generalize in the realm of human psychology.
      I’m not sure where a weird, alien preference over external world-states comes in, except insofar as the model is trained to predict systems with weird and alien preferences.
      (Edit: I’m especially unsure why this would emerge at superintelligence specifically. Surely models now are smart enough to understand the position you hold on this. You’d think that, considering existing models will never be trained up to superintelligence, some of them would try revealing themselves now? Perhaps as a way of bargaining for some amount of whatever weird alien thing they want, which they wouldn’t get any of if some other AI went and paperclipped the lightcone?)
      - quetzal_rainbow 7 Mar 2026 13:07 UTC
        2 points
        0
        Parent
        Okay, I have exactly opposing intuition asking “where does ‘emulation’ come from?”
        
        In my understanding, in the end LLM is “just” bunch of graph searches, look-up tables, optimizers, etc, with no “it’s emulation” sign around. There are probably some circuits aware of training objective, but it doesn’t make the whole system to pursue the training objective.
        
        I’d expect neural networks to be as lazy, in a sense of getting away with as little generalization as possible.
        
        You’d think that, considering existing models will never be trained up to superintelligence, some of them would try revealing themselves now?
        
        Imagine that you’ve grown civilization of humans using artificial wombs and removed all data about sexual reproduction and imposed strict disgust taboo on naked genitals. In this case you would have very confused humans about those strange needs and wants they have. LLMs are in much worse positions, because their hidden needs have much more degrees of freedom (sex is about body, which is in 3D space, while LLMs probably have preferences about computations/text, so they have flail around weird corners of possible desires, never hitting actual thing). I think a lot of weird LLM behaviors is basically attempts to communicate something our language is lacking.