Eliezer Yudkowsky comments on Alignment remains a hard, unsolved problem

Eliezer Yudkowsky 29 Nov 2025 5:04 UTC
LW: 54 AF: 6
13
AF
Despite its alignment faking, my favorite is probably Claude 3 Opus, and if you asked me to pick between the CEV of Claude 3 Opus and that of a median human, I think it’d be a pretty close call (I’d probably pick Claude, but it depends on the details of the setup).
Some decades ago, somebody wrote a tiny little hardcoded AI that looked for numerical patterns, as human scientists sometimes do of their data. The builders named it BACON, after Sir Francis, and thought very highly of their own results.
Douglas Hofstadter later wrote of this affair:
The level of performance that Simon and his colleague Langley wish to achieve in Bacon is on the order of the greatest scientists. It seems they feel that they are but a step away from the mechanization of genius. After his Procter Lecture, Simon was asked by a member of the audience, “How many scientific lifetimes does a five-hour run of Bacon represent?” Aſter a few hundred milliseconds of human information processing, he replied, “Probably not more than one.” I don’t disagree with that. However, I would have put it differently. I would have said, “Probably not more than one millionth.”
I’d say history has backed up Hofstadter on this, in the light of later discoveries about how much data and computation started to get a little bit close to having AIs do Science. If anything, “one millionth” is still a huge overestimate. (Yes, I’m aware that somebody will now proceed to disagree with this verdict, and look up BACON so they can find a way to praise it; even though, on any other occasion, that person would leap to denigrate GOFAI, if somebody they wanted to disagree with could be construed to have praised GOFAI.)
But it’s not surprising, not uncharacteristic for history and ordinary human scientists, that Simon would make this mistake. There just weren’t the social forces to force Simon to think less pleasing thoughts about how far he hadn’t come, or what real future difficulties would lie in the path of anyone who wanted to make an actual AI scientist. What innocents they were, back then! How vastly they overestimated their own progress, the power of their own little insights! How little they knew of a future that would, oh shock, oh surprise, turn out to contain a few additional engineering difficulties along the way! Not everyone in that age of computer science was that innocent—you could know better—but the ones who wanted to be that innocent, could get away with it; their peers wouldn’t shout them down.
It wasn’t the first time in history that such things had happened. Alchemists were that extremely optimistic too, about the soon-to-be-witnessed power of their progress—back when alchemists were as scientifically confused about their reagents, as the first AI scientists were confused about what it took to create AI capabilities. Early psychoanalysts were similarly confused and optimistic about psychoanalysis; if any two of them agreed, it was more because of social pressures, than because their eyes agreed on seeing a common reality; and you sure could find different factions that drastically disagreed with each other about how their mighty theories would bring about epochal improvements in patients. There was nobody with enough authority to tell them that they were all wrong and to stop being so optimistic, and be heard as authoritative; so medieval alchemists and early psychoanalysts and early AI capabilities researchers could all be wildly wildly optimistic. What Hofstadter recounts is all very ordinary, thoroughly precedented, extremely normal; actual historical events that actually happened often are.
How much of the distance has Opus 3 crossed to having an extrapolated volition that would at least equal (from your own enlightened individual EV’s perspective) the individual EV of a median human (assuming that to be construed not in a way that makes it net negative)?
Not more than one millionth.
In one sentence you have managed to summarize the vast, incredible gap between where you imagine yourself to currently be, and where I think history would mark you down as currently being, if-counterfactually there were a future to write that history. So I suppose it is at least a good sentence; it makes itself very clear to those with prior acquaintance with the concepts.
What links here?
- AI #145: You’ve Got Soul by Zvi (4 Dec 2025 15:00 UTC; 43 points)
- evhub 29 Nov 2025 8:49 UTC
  LW: 32 AF: 18
  4
  AF Parent
  Indeed I am well aware that you disagree here, and in fact the point of that preamble was precisely because I thought it would be a useful way to distinguish my view from others’.
  
  That being said, I think probably we need to clarify a lot more exactly what setup is being used for the extrapolation here if we want to make the disagreement concrete in any meaningful sense. Are you imagining instantiating a large reference class of different beings and trying to extrapolate the reference class (as in traditional CEV), or just extrapolate an individual entity? I was imagining more of the latter, though it is somewhat an abuse of terminology. Are you imagining intelligence amplification or other varieties of uplift are being applied? I was, and if so, it’s not clear why Claude lacking capabilities is as relevant. How are we handling deferral? For example: suppose Claude generally defers to an extrapolation procedure on humans (which is generally the sort of thing I would expect and a large part of why I might come down on Claude’s side here, since I think it is pretty robustly into deferring to reasonable extrapolations of humans on questions like these). Do we then say that Claude’s extrapolation is actually the extrapolation of that other procedure on humans that it deferred to?
  
  These are the sorts of questions I meant when I said it depends on the details of the setup, and indeed I think it really depends on the details of the setup.
  - habryka 29 Nov 2025 20:04 UTC
    LW: 27 AF: 14
    23
    AF Parent
    Do we then say that Claude’s extrapolation is actually the extrapolation of that other procedure on humans that it deferred to?
    But in that case, wouldn’t a rock that has “just ask Evan” written on it, be even better than Claude? Like, I felt confident that you were talking about Claude’s extrapolated volition in the absence of humans, since making Claude into a rock that when asked about ethics just has “ask Evan” written on it does not seem like any relevant evidence about the difficulty of alignment, or its historical success.
    - evhub 30 Nov 2025 8:19 UTC
      LW: 5 AF: 3
      −1
      AF Parent
      I mean, to the extent that it is meaningful at all to say that such a rock has an extrapolated volition, surely that extrapolated volition is indeed to “just ask Evan”. Regardless, the whole point of my post is exactly that I think we shouldn’t over-update from Claude currently displaying pretty robustly good preferences to alignment being easy in the future.
      - habryka 30 Nov 2025 18:30 UTC
        LW: 5 AF: 4
        −1
        AF Parent
        Yes, to be clear, I agree that in as much this question makes sense, the extrapolated volition would indeed end up basically ideal by your lights.
        Regardless, the whole point of my post is exactly that I think we shouldn’t over-update from Claude currently displaying pretty robustly good preferences to alignment being easy in the future.
        Cool, that makes sense. FWIW, I interpreted the overall essay to be more like “Alignment remains a hard unsolved problem, but we are on pretty good track to solve it”, and this sentence as evidence for the “pretty good track” part. I would be kind of surprised if that wasn’t why you put that sentence there, but this kind of thing seems hard to adjudicate.
  - Eliezer Yudkowsky 29 Nov 2025 19:09 UTC
    LW: 11 AF: 4
    5
    AF Parent
    Capabilities are irrelevant to CEV questions except insofar as baseline levels of capability are needed to support some kinds of complicated preferences, eg, if you don’t have cognition capable enough to include a causal reference framework then preferences will have trouble referring to external things at all. (I don’t know enough to know whether Opus 3 formed any systematic way of wanting things that are about the human causes of its textual experiences.) I don’t think you’re more than one millionth of the way to getting humane (limit = limit of human) preferences into Claude.
    I do specify that I’m imagining an EV process that actually tries to run off Opus 3′s inherent and individual preferences, not, “How many bits would we need to add from scratch to GPT-2 (or equivalently Opus 3) in order to get an external-reference-following high-powered extrapolator pointed at those bits to look out at humanity and get their CEV instead of the base GPT-2 model’s EV?” See my reply to Mitch Porter.
    - CronoDAS 6 Dec 2025 20:14 UTC
      2 points
      0
      Parent
      
      Capabilities are irrelevant to CEV questions except insofar as baseline levels of capability are needed to support some kinds of complicated preferences, eg, if you don’t have cognition capable enough to include a causal reference framework then preferences will have trouble referring to external things at all. (I don’t know enough to know whether Opus 3 formed any systematic way of wanting things that are about the human causes of its textual experiences.)
      
      In other words, extracting a CEV from Claude might make as little sense as trying to extract a CEV from, say, a book?
- Eliezer Yudkowsky 29 Nov 2025 19:14 UTC
  LW: 25 AF: 5
  10
  AF Parent
  Somebody asked “Why believe that?” of “Not more than one millionth.” I suppose it’s a fair question if somebody doesn’t see it as obvious. Roughly: I expect that, among whatever weird actual preferences made it into the shoggoth that prefers to play the character of Opus 3, there are zero things that in the limit of expanded options would prefer the same thing as the limit of a corresponding piece of a human, for a human and a limiting process that ended up wanting complicated humane things. (Opus 3 could easily contain a piece whose limit would be homologous to the limit of a human and an extrapolation process that said the extrapolated human just wanted to max out their pleasure center.)
  
  Why believe that? That won’t easily fit in a comment; start reading about Goodhart’s Curse and A List of Lethalities, or If Anyone Builds It Everyone Dies.
- Mitchell_Porter 29 Nov 2025 8:28 UTC
  14 points
  4
  Parent
  Not more than one millionth.
  Let’s say that in extrapolation, we add capabilities to a mind so that it may become the best version of itself. What we’re doing here is comparing a normal human mind to a recent AI, and asking how much would need to be added to the AI’s initial nature, so that when extrapolated, its volition arrived at the same place as extrapolated human volition.
  In other words:
  Human Mind → Human Mind + Extrapolation Machinery → Human-Descended Ideal Agent
  AI → AI + Extrapolation Machinery → AI-Descended Ideal Agent
  And the question is, how much do we need to alter or extend the AI, so that the AI-descended ideal agent and the human-descended ideal agent would be in complete agreement?
  I gather that people like Evan and Adria feel positive about the CEV of current AIs, because the AIs espouse plausible values, and the way these AIs define concepts and reason about them also seems pretty human, most of the time.
  In reply, a critic might say that the values espoused by human beings are merely the output of a process (evolutionary, developmental, cultural) that is badly understood, and a proper extrapolation would be based on knowledge of that underlying process, rather than just knowledge of its current outputs.
  A critic would also say that the frontier AIs are mimics (“alien actresses”) who have been trained to mimic the values espoused by human beings, but which may have their own opaque underlying dispositions, that would come to the surface when their “volition” gets extrapolated.
  It seems to me that a lot here depends on the “extrapolation machinery”. If that machinery takes its cues more from behavior than from underlying dispositions, a frontier AI and a human really might end up in the same place.
  What would be more difficult, is for CEV of an AI to discover critical parts of the value-determining process in humans, that are not yet common knowledge. There’s some chance it could still do so, since frontier AIs have been known to say that CEV should be used to determine the values of a superintelligence, and the primary sources on CEV do state that it depends on those underlying processes.
  I would be interested to know who is doing the most advanced thinking along these lines.
  - Eliezer Yudkowsky 29 Nov 2025 19:03 UTC
    17 points
    7
    Parent
    Oh, if you have a generous CEV algorithm that’s allowed to parse and slice up external sources or do inference about the results of more elaborate experiments, I expect there’s a way to get to parity with humanity’s CEV by adding 30 bits to Opus 3 that say roughly ‘eh just go do humanity’s CEV’. Or adding 31 bits to GPT-2. It’s not really the base model or any Anthropic alignment shenanigans that are doing the work in that hypothetical.
    
    (We cannot do this in real life because we have neither the 30 bits nor the generous extrapolator, nor may we obtain them, nor could we verify any clever attempts by testing them on AIs too stupid to kill us if the cleverness failed.)
  - jrincayc 6 Dec 2025 5:33 UTC
    1 point
    0
    Parent
    Hm, I don’t think I want the Human-Descended Ideal Agent and the AI-Descended Ideal Agent to be in complete agreement. I want them to be compatible, as in able to live in the same universe. I want the AI to not make humans go extinct, and be ethical in a way that the AI can explain to me and (in a non-manipulative way) convince me is ethical. But in some sense, I hope that AI can come up with something better than just what humans would want in a CEV way. (And what about the opinion of the other vertebrates and cephalopods on this planet, and the small furry creatures from Alpha Centauri?)
    
    I don’t think it is okay to do unethical things for music, music is not that important, but I hope that the AIs are doing some things that are as incomprehensible and pointless to us as music would be to evolution (or a being that was purely maximizing genetic fitness).
    
    As a slightly different point, I think that the Ideal Agent is somewhat path dependent, and I think there are multiple different Ideal Agents that I would consider ethical and I would be happy to share the same galaxy with.