habryka comments on [Advanced Intro to AI Alignment] 2. What Values May an AI Learn? — 4 Key Problems

habryka 3 Jan 2026 2:33 UTC
11 points
11
I did want to note that any mention of CEV should include a reference to how its creator almost immediately disavowed it. It gets tossed around as an alignment idea, but nobody really defends it. This is right in the CEV LW wikitag.
What… are you talking about? Eliezer has definitely not “disavowed” CEV. This is not “right in the CEV LW wikitag” and many people, including me, would be happy to defend CEV as a useful think to think about.
Mostly as a valuable societal Schelling point that serves as a baseline pointer for how we can eventually distribute gains from AI fairly, and what a good outcome from successfully aligning superintelligent AI systems looks like.
- Ben Pace 3 Jan 2026 2:48 UTC
  10 points
  2
  Parent
  My guess is that this is a misreading of Eliezer’s stance that you should not aim for CEV as your first goal with a superintelligence that you think is plausibly aligned. Quoting the wiki:
  CEV is rather complicated and meta and hence not intended as something you’d do with the first AI you ever tried to build.
  - TAG 3 Jan 2026 3:43 UTC
    3 points
    0
    Parent
    Also:-
    
    CEV seems much much more difficult than strawberry alignment and I have written it off as a potential option for a baby’s first try at constructing superintelligence.
    
    To be clear, I also expect that strawberry alignment is too hard for these babies and we’ll just die. But things can always be even more difficult, and with targeting CEV on a first try, it sure would be.
    
    There’s zero room, there is negative room, to give away to luxury targets like CEV. They’re not even going to be able to do strawberry alignment, and ify some miracle we were able to do strawberry alignment and so humanity survived, that miracle would not suffice to get CEV right on the first try.
- Towards_Keeperhood 3 Jan 2026 8:42 UTC
  6 points
  0
  Parent
  There is another LW wikitag here, which includes:
  Secondly, the possibility that human values may not converge. Yudkowsky considered CEV obsolete almost immediately after its publication in 2004. He states that there’s a “principled distinction between discussing CEV as an initial dynamic of Friendliness, and discussing CEV as a Nice Place to Live” and his essay was essentially conflating the two definitions.
  But I totally agree that CEV is a useful concept to have. Also Yudkowsky’s later writing (like the Arbital post presumably around 2016) should trump his earlier take in 2004. Or maybe the meaning of CEV shifted a bit over the years from sth more specific to a very indirect pointer. Idk, I don’t remember the original CEV paper well.
  - habryka 4 Jan 2026 1:54 UTC
    3 points
    0
    Parent
    Hrmm, what a weird description. The full quote of what it’s quoting is:
    Once we have something that approximates a volition of the human species, that volition then has the chance to write its own superintelligence, optimization process, legislative procedure, god, or constitution. I try not to get caught up on CEV as a model of the actual future, even though it seems like a Nice Place To Live. The purpose of CEV as an initial dynamic is not to be the solution, but to ask what solution we want.
    I really don’t know where that “and his essay was essentially conflating the two definitions” part comes from.
- Seth Herd 3 Jan 2026 20:34 UTC
  2 points
  0
  Parent
  Whoops! You are right that I’m mis-stating the status of the CEV concept and terminology. I hadn’t seen full-blown defenses. I thought that because EY doesn’t defend it, and I share his views, others would largely agree. I was wrong, and I stand corrected. And it seems fine if not ideal to use that term as a common shorthand for a class of alignment targets.
  What I was talking about was this quote, from the other CEV wikitag. Apparently, there are two. The one I’d found on a quick search and linked above was not the one I’d previously read and was trying to cite. The one I’d read matched my memory of EY’s later writings on the topic.
  
  From that LW wikitag article, called coherent extrapolated volition (without the alignment target extension in the name):
  Yudkowsky considered CEV obsolete almost immediately after its publication in 2004. He states that there’s a “principled distinction between discussing CEV as an initial dynamic of Friendliness, and discussing CEV as a Nice Place to Live” and his essay was essentially conflating the two definitions.
  But this tag doesn’t provide a source, and I don’t remember where I read it, nor exactly what he’d said. So “disavowed” is probably the wrong term.
  CEV is defended at length in the longer of the two wikitags that share that name, Coherent extrapolated volition (alignment target), and probably in other places I haven’t happened across. I don’t think that defense really addresses the fundamental flaw, that you’d get different answers if you extrapolated in different directions, and “idealized” the humans in different ways.
  But I don’t think that’s really worth discussing at this point in the alignment challenge. If I thought developers would try for value alignment, I’d think this was worth ironing out.
  So for now it is probably a perfectly good way to gesture at the general idea of letting an AGI figure out what we collectively want or would like. I just don’t like the terminology, and I think it’s accurate to say that Yudkowsky doesn’t either. But that’s probably a nitpick. Few people are claiming it’s all worked out, anyway.
  I prefer just “human values” which is more intuitive as a direction and more obvious that there’s a lot left to solve. That also points intuitively to why a lot of us think it would be safer to shoot for corrigibility or instruction-following or task AGI instead of full value alignment as a first target.
  
  Anyway, sorry, I stand corrected.