Julian Bradshaw comments on Moral Alignment: An Idea I’m Embarrassed I Didn’t Think of Myself

Julian Bradshaw 18 Jun 2025 20:02 UTC
2 points
0
I admit:
1. Human preferences don’t fully cohere, especially when extrapolated
2. There are many ways in which “Humanity’s CEV” is fuzzy or potentially even impossible to fully specify
But I think the concept has staying power because it points to a practical idea of “the AI acts in a way such that most humans think it mostly shares their core values”.^[1] LLMs already aren’t far from this bar with their day-to-day behavior, so it doesn’t seem obviously impossible.
To go back to agreeing with you, yes, adding new types of beings as primary sources of values to the CEV would introduce far more conflicting sets of preferences, maybe to the point that trying to combine them would be totally incoherent. (predator vs. prey examples, parasites, species competing for the same niche, etc etc.) That’s a strong objection to the “all beings everywhere” idea. It’d certainly be simpler to enforce human preferences on animals.
1. ^
  I think of this as meaning the AI isn’t enforcing niche values (“everyone now has to wear Mormon undergarments in order to save their eternal soul”), is not taking obviously horrible actions (“time to unleash the Terminators!”), and is taking some obviously good actions (“I will save the life of this 3-year-old with cancer”). Obviously it would have to be neutral on a lot of things, but there’s quite a lot most humans have in common.