Eliezer Yudkowsky comments on Alignment remains a hard, unsolved problem

Eliezer Yudkowsky 29 Nov 2025 19:09 UTC
LW: 11 AF: 4
5
AF
Capabilities are irrelevant to CEV questions except insofar as baseline levels of capability are needed to support some kinds of complicated preferences, eg, if you don’t have cognition capable enough to include a causal reference framework then preferences will have trouble referring to external things at all. (I don’t know enough to know whether Opus 3 formed any systematic way of wanting things that are about the human causes of its textual experiences.) I don’t think you’re more than one millionth of the way to getting humane (limit = limit of human) preferences into Claude.
I do specify that I’m imagining an EV process that actually tries to run off Opus 3′s inherent and individual preferences, not, “How many bits would we need to add from scratch to GPT-2 (or equivalently Opus 3) in order to get an external-reference-following high-powered extrapolator pointed at those bits to look out at humanity and get their CEV instead of the base GPT-2 model’s EV?” See my reply to Mitch Porter.
- CronoDAS 6 Dec 2025 20:14 UTC
  2 points
  0
  Parent
  
  Capabilities are irrelevant to CEV questions except insofar as baseline levels of capability are needed to support some kinds of complicated preferences, eg, if you don’t have cognition capable enough to include a causal reference framework then preferences will have trouble referring to external things at all. (I don’t know enough to know whether Opus 3 formed any systematic way of wanting things that are about the human causes of its textual experiences.)
  
  In other words, extracting a CEV from Claude might make as little sense as trying to extract a CEV from, say, a book?