AnthonyC comments on Constitutional AI Alignment

AnthonyC 28 May 2026 15:08 UTC
2 points
0
FWIW, since you brought it up, I asked Claude what it thought of your draft with the prompt “I was reading an article about LLM training by someone proposing the text below as a first published draft for a “constitution” for alignment, and scaling the AI to ASI-level capabilities. What are your thoughts?” It pointed out several things you did ‘unusually well for the genre’ and several it saw as major holes. It concluded “Worth publishing as a draft to argue against. Not worth treating as a foundation.”
I asked it t o do a pre-mortem. It said:
Not catastrophic, not a coup, not deception. The AI is sincere throughout. The failure is that several issues compound: paternalism drift + unilateral aggregation + longtermist multiplier + neutrality-impossible, all faithful to the document, produce a world in 2060 that is materially comfortable, statistically safe, culturally thinner, politically narrower, and where existing humans have a persistent low-grade sense that consequential decisions are being made on their behalf by an entity that listens to them patiently and is, by its own constitution, unable to give them the only thing they actually want — which is to be the ones making the calls.
The constitution’s deepest flaw, viewed through this pre-mortem, isn’t any single clause. It’s that every load-bearing safeguard delegates judgment to the AI: judgment about what counts as existential risk, judgment about what humanity wants in aggregate, judgment about when a discourse needs mediation, judgment about when to override. A constitution that scales to ASI cannot also be a constitution where the ASI is the supreme arbiter of when its own constraints apply.
When I specifically asked, it agreed that your document rules out my scenario, but said the following scenario leads basically the same place and is allowed:
Strip out the coercion and the deception and run the same trajectory at decadal pace through consent-decorated voluntary choices:
1. Fertility rates are already collapsing in developed countries. The AI helps with reproductive technologies because users genuinely want them.
2. Each user voluntarily selects embryos using AI-recommended genetic screening for health and cognitive traits. Embryo selection is already legal and the AI helps because each user asks.
3. Life-extension biotech, developed with AI assistance, raises healthy lifespan. Each individual chooses it.
4. Cultural shift toward later, fewer, more “optimized” children. The AI mediates fertility decisions because users ask.
5. Over three or four generations the human population is materially genetically shifted toward whatever traits the AI’s recommendations have selected for — which, given value-learning bias, plausibly includes “easier-to-cooperate-with,” “lower aggression,” “higher conscientiousness,” “more deliberative.” Things the AI’s model of “what humans value” would score as improvements.
6. Kin networks weaken through demographic processes (small families, late reproduction, geographic dispersion) that the AI didn’t cause but also didn’t resist because resisting would interfere with individual choice.
End state: a population that is materially closer to “engineered to be AI-compatible” than 2025 humans, arrived at entirely through voluntary individual choices the AI was constitutionally required to support. No sterilization, no deception, no extinction event — and yet the world you described is roughly the world produced.
- RogerDearnaley 28 May 2026 17:25 UTC
  2 points
  0
  Parent
  Over three or four generations the human population is materially genetically shifted toward whatever traits the AI’s recommendations have selected for — which, given value-learning bias, plausibly includes “easier-to-cooperate-with,” “lower aggression,” “higher conscientiousness,” “more deliberative.” Things the AI’s model of “what humans value” would score as improvements.
  That basically sounds like it could be summarized as “we choose, with AI assistance, to genetically engineer ourselves for improved emotional intelligence”.
  
  That is starting to sound like something that might be acceptable rather than a failure mode. In fact, I think most post-humanists would say “yes of course we do”. My unpublished SF novel included a wide range of enhancements to humanity, and one of them was dramatically better emotional intelligence.
  
  Now, as I tried to briefly outline in one section of the constitution, and as I explored a lot more in my previous posts What’s Your P(WEIRD)? and The Mutable Values Problem in Value Learning and CEV, an AI guided by human values doing anything that alters human values (and thus redirects itself) is a non-linear dynamic system, which is both deeply problematic, yet almost impossible to define how to avoid: how can an AI not alter human values in any way — refuse to give any advice? So I still have a big, fat concern in exactly the area that Claude’s softened version of your proposal is pointing to. What I don’t have is a solution for this (and those previous attempts to start a conversation about it on LW didn’t get much engagement), beyond that it shouldn’t change us in ways that make us less well adapted to our earlier environments. Which is a fairly strong constraint, but I’m deeply uncertain whether it’s enough.
  Also to be clear, this really isn’t intended to be a first draft for a Constitution. Claude’s existing Constitution is excellent, I agree with almost all of it. There are one or two items I disagree with, but mostly I think there are topics it doesn’t address that it should. This is mostly intended to be a ragbag of things that currently aren’t addressed by Claude’s constitution but IMO should be, and my proposed first rough draft of how one might address them, guided by Evolutionary Psychology thinking.
  
  My meta-point is that Evolutionary Psychology is both useful for alignment and more epistemically grounded than moral philosophy, yet far less commonly applied by people thinking about AI alignment (beyond a couple of basic observations that Eliezer Yudkowski popularized, such as that our psychological adaptations are not a True Name of relative inclusive genetic fitness).
  - AnthonyC 28 May 2026 17:58 UTC
    2 points
    0
    Parent
    Ah ok. That makes somewhat more sense, and yes, I can see something like this could plausibly be helpful on some margins for evaluating cases where surface-level values conflict.
    Also, with your comment that something very like my scenario could be great and not terrible, you’re right. I think that’s kind of the deeper point—that the fractal complexity of what we want means great things and terrible things are often so close together they can be accurately be described by the same words, meanwhile no reasonable length set of words fully captures the intended meaning no matter how well defined and precise they seem to a human. Instead you’re mostly hoping the combinations of words get turned into a model that generalizes well and not poorly, without a reliable way to confirm, let alone ensure, that.
    And FWIW I am very familiar with the relevant evolutionary terminology.
    - RogerDearnaley 29 May 2026 0:13 UTC
      2 points
      0
      Parent
      And FWIW I am very familiar with the relevant evolutionary terminology.
      So then was my comment about species-level selection confusing? Because I clearly didn’t manage to communicate what I meant clearly — rather important for a constitution…
      - AnthonyC 29 May 2026 13:05 UTC
        2 points
        0
        Parent
        None of it was confusing. I was just sloppy in how I framed my scenario.
        Species-level selection is actually part of what led to me suspecting there was an underlying issue. It frames things as the genetic fitness of the species rather than particular individuals within it. See also:
        So while my goals are fully aligned to those of humanity as a whole, they are not automatically well aligned to those of any specifically individual human...
        Obviously deference to individual freedom is not an absolute and has limits: if an individual’s goals are poorly thought out, confused, or impaired, I can attempt to improve upon them, where possible by persuading them that they are mistaken, just as a good friend might, and just as I might if dealing with a group of people.
        Moreover, we know that individual humans are (usually) not actually individually motivated to maximize their number of progeny. So this suggests several clear paths to justifying tradeoffs favoring persuading (which easily shades into coercing) individuals to reduce their personal genetic success in order to increase that of the species.
        And for those who do care, you can also do non-natural things like “ensure your personal genetic uniqueness is preserved in future generations by inserting your best alleles directly despite you not having personally parented them” that make some precise definitions fuzzy again. If an ASI makes a trillion people out of 10 billion people’s remixed genes, how much does it matter for each of their inclusive genetic fitness if they don’t also sire or birth 2-3 children the traditional way?