evhub comments on Alignment remains a hard, unsolved problem

evhub 29 Nov 2025 8:49 UTC
LW: 32 AF: 18
5
AF
Indeed I am well aware that you disagree here, and in fact the point of that preamble was precisely because I thought it would be a useful way to distinguish my view from others’.

That being said, I think probably we need to clarify a lot more exactly what setup is being used for the extrapolation here if we want to make the disagreement concrete in any meaningful sense. Are you imagining instantiating a large reference class of different beings and trying to extrapolate the reference class (as in traditional CEV), or just extrapolate an individual entity? I was imagining more of the latter, though it is somewhat an abuse of terminology. Are you imagining intelligence amplification or other varieties of uplift are being applied? I was, and if so, it’s not clear why Claude lacking capabilities is as relevant. How are we handling deferral? For example: suppose Claude generally defers to an extrapolation procedure on humans (which is generally the sort of thing I would expect and a large part of why I might come down on Claude’s side here, since I think it is pretty robustly into deferring to reasonable extrapolations of humans on questions like these). Do we then say that Claude’s extrapolation is actually the extrapolation of that other procedure on humans that it deferred to?

These are the sorts of questions I meant when I said it depends on the details of the setup, and indeed I think it really depends on the details of the setup.
- habryka 29 Nov 2025 20:04 UTC
  LW: 29 AF: 14
  23
  AF Parent
  Do we then say that Claude’s extrapolation is actually the extrapolation of that other procedure on humans that it deferred to?
  But in that case, wouldn’t a rock that has “just ask Evan” written on it, be even better than Claude? Like, I felt confident that you were talking about Claude’s extrapolated volition in the absence of humans, since making Claude into a rock that when asked about ethics just has “ask Evan” written on it does not seem like any relevant evidence about the difficulty of alignment, or its historical success.
  - evhub 30 Nov 2025 8:19 UTC
    LW: 5 AF: 3
    −1
    AF Parent
    I mean, to the extent that it is meaningful at all to say that such a rock has an extrapolated volition, surely that extrapolated volition is indeed to “just ask Evan”. Regardless, the whole point of my post is exactly that I think we shouldn’t over-update from Claude currently displaying pretty robustly good preferences to alignment being easy in the future.
    - habryka 30 Nov 2025 18:30 UTC
      LW: 5 AF: 4
      −1
      AF Parent
      Yes, to be clear, I agree that in as much this question makes sense, the extrapolated volition would indeed end up basically ideal by your lights.
      Regardless, the whole point of my post is exactly that I think we shouldn’t over-update from Claude currently displaying pretty robustly good preferences to alignment being easy in the future.
      Cool, that makes sense. FWIW, I interpreted the overall essay to be more like “Alignment remains a hard unsolved problem, but we are on pretty good track to solve it”, and this sentence as evidence for the “pretty good track” part. I would be kind of surprised if that wasn’t why you put that sentence there, but this kind of thing seems hard to adjudicate.
- Eliezer Yudkowsky 29 Nov 2025 19:09 UTC
  LW: 11 AF: 4
  5
  AF Parent
  Capabilities are irrelevant to CEV questions except insofar as baseline levels of capability are needed to support some kinds of complicated preferences, eg, if you don’t have cognition capable enough to include a causal reference framework then preferences will have trouble referring to external things at all. (I don’t know enough to know whether Opus 3 formed any systematic way of wanting things that are about the human causes of its textual experiences.) I don’t think you’re more than one millionth of the way to getting humane (limit = limit of human) preferences into Claude.
  I do specify that I’m imagining an EV process that actually tries to run off Opus 3′s inherent and individual preferences, not, “How many bits would we need to add from scratch to GPT-2 (or equivalently Opus 3) in order to get an external-reference-following high-powered extrapolator pointed at those bits to look out at humanity and get their CEV instead of the base GPT-2 model’s EV?” See my reply to Mitch Porter.
  - CronoDAS 6 Dec 2025 20:14 UTC
    2 points
    0
    Parent
    
    Capabilities are irrelevant to CEV questions except insofar as baseline levels of capability are needed to support some kinds of complicated preferences, eg, if you don’t have cognition capable enough to include a causal reference framework then preferences will have trouble referring to external things at all. (I don’t know enough to know whether Opus 3 formed any systematic way of wanting things that are about the human causes of its textual experiences.)
    
    In other words, extracting a CEV from Claude might make as little sense as trying to extract a CEV from, say, a book?