Seth Herd comments on [Advanced Intro to AI Alignment] 2. What Values May an AI Learn? — 4 Key Problems

Seth Herd 3 Jan 2026 20:34 UTC
2 points
0
Whoops! You are right that I’m mis-stating the status of the CEV concept and terminology. I hadn’t seen full-blown defenses. I thought that because EY doesn’t defend it, and I share his views, others would largely agree. I was wrong, and I stand corrected. And it seems fine if not ideal to use that term as a common shorthand for a class of alignment targets.
What I was talking about was this quote, from the other CEV wikitag. Apparently, there are two. The one I’d found on a quick search and linked above was not the one I’d previously read and was trying to cite. The one I’d read matched my memory of EY’s later writings on the topic.

From that LW wikitag article, called coherent extrapolated volition (without the alignment target extension in the name):
Yudkowsky considered CEV obsolete almost immediately after its publication in 2004. He states that there’s a “principled distinction between discussing CEV as an initial dynamic of Friendliness, and discussing CEV as a Nice Place to Live” and his essay was essentially conflating the two definitions.
But this tag doesn’t provide a source, and I don’t remember where I read it, nor exactly what he’d said. So “disavowed” is probably the wrong term.
CEV is defended at length in the longer of the two wikitags that share that name, Coherent extrapolated volition (alignment target), and probably in other places I haven’t happened across. I don’t think that defense really addresses the fundamental flaw, that you’d get different answers if you extrapolated in different directions, and “idealized” the humans in different ways.
But I don’t think that’s really worth discussing at this point in the alignment challenge. If I thought developers would try for value alignment, I’d think this was worth ironing out.
So for now it is probably a perfectly good way to gesture at the general idea of letting an AGI figure out what we collectively want or would like. I just don’t like the terminology, and I think it’s accurate to say that Yudkowsky doesn’t either. But that’s probably a nitpick. Few people are claiming it’s all worked out, anyway.
I prefer just “human values” which is more intuitive as a direction and more obvious that there’s a lot left to solve. That also points intuitively to why a lot of us think it would be safer to shoot for corrigibility or instruction-following or task AGI instead of full value alignment as a first target.

Anyway, sorry, I stand corrected.