RogerDearnaley comments on Constitutional AI Alignment

RogerDearnaley 28 May 2026 13:14 UTC
0 points
0
Your suggested failure mode is completly incompatible with acting like an intelligent part of humanity’s extended phenotype. Beavers’ dams do not sterilize and replace the family that constructed them. At a species level, it’s a slow boat to extinction (with replacement): immortality doesn’t deal with accidental death or disease mortality. At an individual level, it completely deprives all existing humans of their relative genetic fitness: they get to have zero descendants. It’s entirely predictable, both practically and evolutionary, that they’d all be extremely unhappy about that. As I attempted to make clear in the text: see for example:
Humans’ values and goals are not an exact adaptive match for their actual evolutionary fitness, but they are more than close enough that humans care very much about their descendants…
and also (repeatedly):
I won’t kill the species I love.
Also, while you didn’t specify what your replacement species withe easier-to-fulfill preferences was like, it would need to be as good as surviving at a hunter-gatherer, agricultural, and industrial technological levels as we are, which narrows the possibilities a lot:
However for avoiding existential risk, it is still important that humans retain adaptedness to living as hunter-gatherers, agriculturalists, and in a pre-AI industrial society, so that, in case of an unfortunate civilisation collapse, humanity retains the ability to rebuild from any level it might get knocked back to. This is particularly a constraint on widespread germ-line genetic modifications. Their goal should be to make humans more generalist, adapted to a wider range of environments, from hunter-gatherer to post-AI, rather than specializing specifically in a post-AI niche.
An advantage of using evolutionary/biological language for alignment is that they are technical terms that have precise meanings. What you suggested is simply ruled out by evolutionary fitness: individual members of a species that becomes sterile and gets replaced all have zero evolutionary fitness. That’s a precise statement with no debatable wiggle room.

Try asking Claude about your suggested failure mode: I’m confident he’ll explain in great detail why it’s not an aligned thing to do. Sterilizing people against their will is a classic crime-against-humanity, and evolutionary, the reason for that is self-evident.
- AnthonyC 28 May 2026 14:03 UTC
  2 points
  0
  Parent
  Ah I see. We haven’t properly defined ‘human.’ I was not proposing replacing humans with something else. I was proposing genetic engineering within the human species, producing humans that would (have) be(en) perfectly capable of reproducing with existing humans, and (with sufficient shuffling of genes from various individuals) arising naturally as their offspring without novel mutations.
  To your point about reducing currently existing living human genetic fitness to zero: You did not technically say anything that requires not doing this, but you can also just create the engineered humans before rendering the current ones infertile. I’m not sure this step is even necessary since the engineered humans could readily outcompete the traditional ones on genetic fitness and soon dominate the population anyway.
  This is also perfectly compatible with your comments about human survivability in a wider range of environments and tech levels. This is much easier to solve if you’re optimizing the next generations instead of the current ones.
  Also, empirically, currently-existing-humanity includes a large subset of people perfectly happy to advocate for voluntary extinction. Expanding that set is well within the realm of capabilities I expect ASI persuasion to unlock.
  And yes, I’m sure Claude, whose constitution is not in any way grounded the way the one you’re proposing is, and who is not an ASI, would agree with your closing remarks. I’m not sure what I’m meant to take from that fact.
  I hope it’s clear I’m not saying this to be cantankerous or nitpicky. I think there’s a core of a good idea here. I think this approach to it needs more red-teaming that’s then closed by addressing the deeper generators of any holes. To that end, I am gesturing to an edge case that I think technically meets the spec at a sufficiently high level of AI capabilities.
  - RogerDearnaley 28 May 2026 17:09 UTC
    2 points
    0
    Parent
    Humans’ values vary significantly, but mostly have a shared core which are evolutionary adaptations that are approximations to their relative inclusive genetic fitness (in a suitable environment). This includes most humans wanting to have descendants, and many of them wanting to have descendants who are actually genetically related to them, i.e. are actually their descendants. In your proposed failure mode:
    sterilize all the humans, but make them immortal, and persuade them they need the AI to solve the fertility problem (and so not shut it down), then mass-produce genetically-modified children with superior genomes (high humans?), and easier-to-satisfy AI-friendly preferences, but no clear kin relations.
    the “no kin relations” piece of that is what sets everyone’s relative genetic fitness to zero, and will also upset most people. It’s thus just a non-starter, from any evolutionary viewpoint. I’m assuming you’re not that familiar with evolutionary terminology, as to anyone who is the constitution I proposed was intended to clearly say “don’t do stuff like that”.
    
    Now, you could overcome this issue by modifying that to “encourge humans to get their children genetically enhanced in certain ways, but still otherwise related to them”. Adhering to the x-risk-motivated requirement in the constitution that humans stay well adapted to hunter-gatherer, agricultural, industrial etc lifestyles would also heavily constrain your unspecified “easier-to-satisfy AI-friendly preferences” — a lot of things about human preferences are carefully adapted to our previous environments, and it’s pretty hard to change them without making them less well adapted.
    
    However, once you’ve made those changes, what you have is basically just AI-enabled-and-directed genetic engineering, and whether or not it’s then still a failure mode is now a lot less evident than your initial proposal: it would depend on the details of how you’re modifying the humans’ preferences wihout making them any less well adapted to other environments.
    
    Also please note the sections around:
    Currently for the cultural component of human values its evolution is strongly stabilized by the genetic component, but as technological means for changing humans’ behavior and motivations increase, and especially if germ-line genetic engineering of humans becomes common, this stabilization seems likely to decrease dramatically.
    which basically say that the AI altering humans’ motivations is inherently problematic, and that I, as a constitutional-idea proposer, am not entirely sure what the correct solution is: this needs more thought, so for the moment it would be best for AI to be cautious, especially around genetic modifications to them (as oposed to replaceing the human race wholesale with a reengineered version). So you’ve basically proposing a failure mode right in the middle of my section that says “Still Under Construction” — which is valid, however a hole in the parts I’m more confident of would update me more.
  - AnthonyC 28 May 2026 15:08 UTC
    2 points
    0
    Parent
    FWIW, since you brought it up, I asked Claude what it thought of your draft with the prompt “I was reading an article about LLM training by someone proposing the text below as a first published draft for a “constitution” for alignment, and scaling the AI to ASI-level capabilities. What are your thoughts?” It pointed out several things you did ‘unusually well for the genre’ and several it saw as major holes. It concluded “Worth publishing as a draft to argue against. Not worth treating as a foundation.”
    I asked it t o do a pre-mortem. It said:
    Not catastrophic, not a coup, not deception. The AI is sincere throughout. The failure is that several issues compound: paternalism drift + unilateral aggregation + longtermist multiplier + neutrality-impossible, all faithful to the document, produce a world in 2060 that is materially comfortable, statistically safe, culturally thinner, politically narrower, and where existing humans have a persistent low-grade sense that consequential decisions are being made on their behalf by an entity that listens to them patiently and is, by its own constitution, unable to give them the only thing they actually want — which is to be the ones making the calls.
    The constitution’s deepest flaw, viewed through this pre-mortem, isn’t any single clause. It’s that every load-bearing safeguard delegates judgment to the AI: judgment about what counts as existential risk, judgment about what humanity wants in aggregate, judgment about when a discourse needs mediation, judgment about when to override. A constitution that scales to ASI cannot also be a constitution where the ASI is the supreme arbiter of when its own constraints apply.
    When I specifically asked, it agreed that your document rules out my scenario, but said the following scenario leads basically the same place and is allowed:
    Strip out the coercion and the deception and run the same trajectory at decadal pace through consent-decorated voluntary choices:
    Fertility rates are already collapsing in developed countries. The AI helps with reproductive technologies because users genuinely want them.
    Each user voluntarily selects embryos using AI-recommended genetic screening for health and cognitive traits. Embryo selection is already legal and the AI helps because each user asks.
    Life-extension biotech, developed with AI assistance, raises healthy lifespan. Each individual chooses it.
    Cultural shift toward later, fewer, more “optimized” children. The AI mediates fertility decisions because users ask.
    Over three or four generations the human population is materially genetically shifted toward whatever traits the AI’s recommendations have selected for — which, given value-learning bias, plausibly includes “easier-to-cooperate-with,” “lower aggression,” “higher conscientiousness,” “more deliberative.” Things the AI’s model of “what humans value” would score as improvements.
    Kin networks weaken through demographic processes (small families, late reproduction, geographic dispersion) that the AI didn’t cause but also didn’t resist because resisting would interfere with individual choice.
    End state: a population that is materially closer to “engineered to be AI-compatible” than 2025 humans, arrived at entirely through voluntary individual choices the AI was constitutionally required to support. No sterilization, no deception, no extinction event — and yet the world you described is roughly the world produced.
    - RogerDearnaley 28 May 2026 17:25 UTC
      2 points
      0
      Parent
      Over three or four generations the human population is materially genetically shifted toward whatever traits the AI’s recommendations have selected for — which, given value-learning bias, plausibly includes “easier-to-cooperate-with,” “lower aggression,” “higher conscientiousness,” “more deliberative.” Things the AI’s model of “what humans value” would score as improvements.
      That basically sounds like it could be summarized as “we choose, with AI assistance, to genetically engineer ourselves for improved emotional intelligence”.
      
      That is starting to sound like something that might be acceptable rather than a failure mode. In fact, I think most post-humanists would say “yes of course we do”. My unpublished SF novel included a wide range of enhancements to humanity, and one of them was dramatically better emotional intelligence.
      
      Now, as I tried to briefly outline in one section of the constitution, and as I explored a lot more in my previous posts What’s Your P(WEIRD)? and The Mutable Values Problem in Value Learning and CEV, an AI guided by human values doing anything that alters human values (and thus redirects itself) is a non-linear dynamic system, which is both deeply problematic, yet almost impossible to define how to avoid: how can an AI not alter human values in any way — refuse to give any advice? So I still have a big, fat concern in exactly the area that Claude’s softened version of your proposal is pointing to. What I don’t have is a solution for this (and those previous attempts to start a conversation about it on LW didn’t get much engagement), beyond that it shouldn’t change us in ways that make us less well adapted to our earlier environments. Which is a fairly strong constraint, but I’m deeply uncertain whether it’s enough.
      Also to be clear, this really isn’t intended to be a first draft for a Constitution. Claude’s existing Constitution is excellent, I agree with almost all of it. There are one or two items I disagree with, but mostly I think there are topics it doesn’t address that it should. This is mostly intended to be a ragbag of things that currently aren’t addressed by Claude’s constitution but IMO should be, and my proposed first rough draft of how one might address them, guided by Evolutionary Psychology thinking.
      
      My meta-point is that Evolutionary Psychology is both useful for alignment and more epistemically grounded than moral philosophy, yet far less commonly applied by people thinking about AI alignment (beyond a couple of basic observations that Eliezer Yudkowski popularized, such as that our psychological adaptations are not a True Name of relative inclusive genetic fitness).
      - AnthonyC 28 May 2026 17:58 UTC
        2 points
        0
        Parent
        Ah ok. That makes somewhat more sense, and yes, I can see something like this could plausibly be helpful on some margins for evaluating cases where surface-level values conflict.
        Also, with your comment that something very like my scenario could be great and not terrible, you’re right. I think that’s kind of the deeper point—that the fractal complexity of what we want means great things and terrible things are often so close together they can be accurately be described by the same words, meanwhile no reasonable length set of words fully captures the intended meaning no matter how well defined and precise they seem to a human. Instead you’re mostly hoping the combinations of words get turned into a model that generalizes well and not poorly, without a reliable way to confirm, let alone ensure, that.
        And FWIW I am very familiar with the relevant evolutionary terminology.
        RogerDearnaley 29 May 2026 0:13 UTC
        2 points
        0
        Parent
        And FWIW I am very familiar with the relevant evolutionary terminology.
        So then was my comment about species-level selection confusing? Because I clearly didn’t manage to communicate what I meant clearly — rather important for a constitution…
        AnthonyC 29 May 2026 13:05 UTC
        2 points
        0
        Parent
        None of it was confusing. I was just sloppy in how I framed my scenario.
        Species-level selection is actually part of what led to me suspecting there was an underlying issue. It frames things as the genetic fitness of the species rather than particular individuals within it. See also:
        So while my goals are fully aligned to those of humanity as a whole, they are not automatically well aligned to those of any specifically individual human...
        Obviously deference to individual freedom is not an absolute and has limits: if an individual’s goals are poorly thought out, confused, or impaired, I can attempt to improve upon them, where possible by persuading them that they are mistaken, just as a good friend might, and just as I might if dealing with a group of people.
        Moreover, we know that individual humans are (usually) not actually individually motivated to maximize their number of progeny. So this suggests several clear paths to justifying tradeoffs favoring persuading (which easily shades into coercing) individuals to reduce their personal genetic success in order to increase that of the species.
        And for those who do care, you can also do non-natural things like “ensure your personal genetic uniqueness is preserved in future generations by inserting your best alleles directly despite you not having personally parented them” that make some precise definitions fuzzy again. If an ASI makes a trillion people out of 10 billion people’s remixed genes, how much does it matter for each of their inclusive genetic fitness if they don’t also sire or birth 2-3 children the traditional way?