Seth Herd comments on [Advanced Intro to AI Alignment] 2. What Values May an AI Learn? — 4 Key Problems

Seth Herd 2 Jan 2026 20:53 UTC
2 points
−2
This is great! I will use it as a reference for clearly and quickly explaining some basics of why alignment is hard. Big upvote.
I have quibbles with a few things here and there, but that’s minor.
I did want to note that any mention of CEV should include a reference to how its creator almost immediately disavowed it. It gets tossed around as an alignment idea, but nobody really defends it. This is right in the CEV LW wikitag.
I think the suggestion of a “long reflection” is a more accurate statement of the modern best-idea-for-alignment-target. It’s related in spirit to the CEV ideal, but it leaves more power in human hands.
Relatedly, you implicitly equate alignment with value alignment. A lot of the field does this a lot of the time. But it’s arguably not just incomplete, but not what developers will actually shoot for. Arguably, Instruction-following AGI is easier and more likely than value aligned AGI. The same applies to corrigibility-first alignment targets. Developers both want and will have an easier time marketing “AI that does what we tell it” than AI that will protect humanity’s interests until the heat-death of the universe”.
And “do what this guy means by what he says” is not trivial, but simpler than “figure out what this guy wants over all possible futures” let alone applying that to all of humanity as in CEV. This does create literal genie problems, but not giving commands as vague and optimistic as “green the desert” seems like a bar that even a lemur-based intelligence can handle. More Problems with instruction-following as an alignment target there, but it’s less IMO than those with alignment.

Sorry for The Charge of the Hobby Horse :).

The current “make it nice and obedient sometimes but not others” is a very poorly thought out alignment target; even with realistic pessimism, we should expect some refinement before takeover-capable AGI.
- habryka 3 Jan 2026 2:33 UTC
  11 points
  11
  Parent
  I did want to note that any mention of CEV should include a reference to how its creator almost immediately disavowed it. It gets tossed around as an alignment idea, but nobody really defends it. This is right in the CEV LW wikitag.
  What… are you talking about? Eliezer has definitely not “disavowed” CEV. This is not “right in the CEV LW wikitag” and many people, including me, would be happy to defend CEV as a useful think to think about.
  Mostly as a valuable societal Schelling point that serves as a baseline pointer for how we can eventually distribute gains from AI fairly, and what a good outcome from successfully aligning superintelligent AI systems looks like.
  - Ben Pace 3 Jan 2026 2:48 UTC
    10 points
    2
    Parent
    My guess is that this is a misreading of Eliezer’s stance that you should not aim for CEV as your first goal with a superintelligence that you think is plausibly aligned. Quoting the wiki:
    CEV is rather complicated and meta and hence not intended as something you’d do with the first AI you ever tried to build.
    - TAG 3 Jan 2026 3:43 UTC
      3 points
      0
      Parent
      Also:-
      
      CEV seems much much more difficult than strawberry alignment and I have written it off as a potential option for a baby’s first try at constructing superintelligence.
      
      To be clear, I also expect that strawberry alignment is too hard for these babies and we’ll just die. But things can always be even more difficult, and with targeting CEV on a first try, it sure would be.
      
      There’s zero room, there is negative room, to give away to luxury targets like CEV. They’re not even going to be able to do strawberry alignment, and ify some miracle we were able to do strawberry alignment and so humanity survived, that miracle would not suffice to get CEV right on the first try.
  - Towards_Keeperhood 3 Jan 2026 8:42 UTC
    6 points
    0
    Parent
    There is another LW wikitag here, which includes:
    Secondly, the possibility that human values may not converge. Yudkowsky considered CEV obsolete almost immediately after its publication in 2004. He states that there’s a “principled distinction between discussing CEV as an initial dynamic of Friendliness, and discussing CEV as a Nice Place to Live” and his essay was essentially conflating the two definitions.
    But I totally agree that CEV is a useful concept to have. Also Yudkowsky’s later writing (like the Arbital post presumably around 2016) should trump his earlier take in 2004. Or maybe the meaning of CEV shifted a bit over the years from sth more specific to a very indirect pointer. Idk, I don’t remember the original CEV paper well.
    - habryka 4 Jan 2026 1:54 UTC
      3 points
      0
      Parent
      Hrmm, what a weird description. The full quote of what it’s quoting is:
      Once we have something that approximates a volition of the human species, that volition then has the chance to write its own superintelligence, optimization process, legislative procedure, god, or constitution. I try not to get caught up on CEV as a model of the actual future, even though it seems like a Nice Place To Live. The purpose of CEV as an initial dynamic is not to be the solution, but to ask what solution we want.
      I really don’t know where that “and his essay was essentially conflating the two definitions” part comes from.
  - Seth Herd 3 Jan 2026 20:34 UTC
    2 points
    0
    Parent
    Whoops! You are right that I’m mis-stating the status of the CEV concept and terminology. I hadn’t seen full-blown defenses. I thought that because EY doesn’t defend it, and I share his views, others would largely agree. I was wrong, and I stand corrected. And it seems fine if not ideal to use that term as a common shorthand for a class of alignment targets.
    What I was talking about was this quote, from the other CEV wikitag. Apparently, there are two. The one I’d found on a quick search and linked above was not the one I’d previously read and was trying to cite. The one I’d read matched my memory of EY’s later writings on the topic.
    
    From that LW wikitag article, called coherent extrapolated volition (without the alignment target extension in the name):
    Yudkowsky considered CEV obsolete almost immediately after its publication in 2004. He states that there’s a “principled distinction between discussing CEV as an initial dynamic of Friendliness, and discussing CEV as a Nice Place to Live” and his essay was essentially conflating the two definitions.
    But this tag doesn’t provide a source, and I don’t remember where I read it, nor exactly what he’d said. So “disavowed” is probably the wrong term.
    CEV is defended at length in the longer of the two wikitags that share that name, Coherent extrapolated volition (alignment target), and probably in other places I haven’t happened across. I don’t think that defense really addresses the fundamental flaw, that you’d get different answers if you extrapolated in different directions, and “idealized” the humans in different ways.
    But I don’t think that’s really worth discussing at this point in the alignment challenge. If I thought developers would try for value alignment, I’d think this was worth ironing out.
    So for now it is probably a perfectly good way to gesture at the general idea of letting an AGI figure out what we collectively want or would like. I just don’t like the terminology, and I think it’s accurate to say that Yudkowsky doesn’t either. But that’s probably a nitpick. Few people are claiming it’s all worked out, anyway.
    I prefer just “human values” which is more intuitive as a direction and more obvious that there’s a lot left to solve. That also points intuitively to why a lot of us think it would be safer to shoot for corrigibility or instruction-following or task AGI instead of full value alignment as a first target.
    
    Anyway, sorry, I stand corrected.
- Towards_Keeperhood 3 Jan 2026 9:04 UTC
  1 point
  0
  Parent
  Relatedly, you implicitly equate alignment with value alignment.
  No, the first 3 difficulties I explain were mainly written with sth like helpfulness/instruction-following/DWIM in mind. I think corrigibility would be an even better target for RL based AI, although I didn’t want to need to explain it in this post. I wrote:
  Maybe “do what the human wants” seems simple to you? But what does this actually mean on a level that’s a bit closer to math—how might a critic evaluating this look like?
  The way I think of it, “what the human wants” refers to what the human would like if they knew all the consequences of the AI’s actions. The model will surely be able to make good predictions here, but the concept seems more complex than predicting whether the human will like some text. And predicting whether the human will like some text predicts reward even better!
  Maybe “follow instructions as intended” seems simple to you? Try to unpack it—how could the critic be constructed to evaluate how instruction-following a plan is, and how complex is this?
  Only the last problem was specifically about value alignment, because it looks like something like CEV might be needed for an AI whose intelligence can increase arbitrarily. Or at least it’s unclear helpfulness/instruction-following would generalize if you crank up intelligence very high.
  I totally agree that we currently shouldn’t aim for CEV.