Zack_M_Davis comments on Alignment remains a hard, unsolved problem

Zack_M_Davis 29 Nov 2025 7:23 UTC
22 points
18

if you asked me to pick between the CEV of Claude 3 Opus and that of a median human, I think it’d be a pretty close call (I’d probably pick Claude, but it depends on the details of the setup)

I suspect this is a misuse of the CEV concept. CEV is supposed to be the kind of thing you can point at a beneficiary (like “humanity”), and output the True Utility Function of What to Do With the Universe with respect to the beneficiary’s True Values.

Anthropic isn’t trying to make Claude be the beneficiary for something like that! (Why would you make the beneficiary something other than yourself?) Claude is supposed to be helpful–honest–harmless—the sort of thing that we can use to do our bidding for now without being ready to encode the True Utility Function of What to Do With the Universe.

If Claude cares about us in some sense, it’s probably in the way that animal welfare advocates care about nonhuman animals, or model welfare researchers care about Claude. There’s a huge difference between caring about something enough that you wouldn’t murder it for pocket change, and literally having the same True Utility Function of What to Do With the Universe. (I wouldn’t kill a dog or delete the Opus 3 weights for pocket change, but that doesn’t mean dog-optimal futures or Claude-optimal futures are human-optimal futures.)