Max H comments on Alignment remains a hard, unsolved problem

Max H 27 Nov 2025 16:09 UTC
37 points
25
if you asked me to pick between the CEV of Claude 3 Opus and that of a median human, I think it’d be a pretty close call (I’d probably pick Claude, but it depends on the details of the setup).
This example seems like it is kind of missing the point of CEV in the first place? If you’re at the point where you can actually pick the CEV of some person or AI, you’ve already solved most or all of your hard problems.
Setting aside that picking a particular entity is already getting away from the original formulation of CEV somewhat, the main reason I see to pick a human over Opus is that a median human very likely has morally-relevant-to-other-humans qualia, in ways that current AIs may not.
I realize this is maybe somewhat tangential to the rest of the post, but I think this sort of disagreement is central to a lot of (IMO misplaced) optimism based on observations of current AIs, and implies an unjustifiably high level of confidence in a theory of mind of AIs, by putting that theory on par with a level of confidence that you can justifiably have in a theory of mind for humans. Elaborating / speculating a bit:
My guess is that you lean towards Opus based on a combination of (a) chatting with it for a while and seeing that it says nice things about humans, animals, AIs, etc. in a way that respects those things’ preferences and shows a generalized caring about sentience and (b) running some experiments on its internals to see that these preferences are deep or robust in some way, under various kinds of perturbations.
But I think what models (or a median / randomly chosen human) say about these things is actually one of the less important considerations. I am not as pessimistic as, say, Wei Dai about how bad humans currently are at philosophy, but neither the median human nor any AI model that I have seen so far can talk sensibly about the philosophy of consciousness, morality, alignment, etc. nor even really come close. So on my view, outputs (both words and actions) of both current AIs and average humans on these topics are less relevant (for CEV purposes) than the underlying generators of those thoughts and actions.
In humans, we have a combination of (a) knowing a lot about evolution and neuroscience and (b) being humans ourselves. Taken together, these two things bridge the gap of a lot of missing or contentious philosophical knowledge—we don’t have to know exactly what qualia are to be pretty confident that other humans have them via introspection + knowing that the generators are (mechanically) very similar. Also, we know that the generators of goodness and sentience in humans generalize well enough, at least from median to >.1%ile humans—for the same reasons (a) and (b) above, we can be pretty confident that the smartest and most good among us feel love, pain, sorrow, etc. in roughly similar ways to everyone else, and being multiple standard deviations (upwards) among humans for smartness and / or goodness (usually) doesn’t cause a person to do crazy / harmful things. I don’t think we have similarly strong evidence about how AIs generalize even up to that point (let alone beyond).

Not sure where / if you disagree with any of this, but either way, the point is that I think that “I would pick Opus over a human” for anything CEV-adjacent implies a lot more confidence in a philosophy of both human and AI minds than is warranted.
In the spirit of making empirical / falsifiable predictions, a thing that would change my view on this is if AI researchers (or AIs themselves) started producing better philosophical insights about consciousness, metaethics, etc. than the best humans did in 2008, where these insights are grounded by their applicability to and experimental predictions about humans and human consciousness (rather than being self-referential / potentially circular insights about AIs themselves). I don’t think Eliezer got everything right about philosophy, morality, consciousness, etc. 15y ago, but I haven’t seen much in the way of public writing or discourse that has improved on things since then, and in many ways the quality of discourse has gotten worse. I think it would be a positive sign (but don’t expect to see it) if AIs were to change that.
- Anurag 1 Dec 2025 14:49 UTC
  1 point
  0
  Parent
  So on my view, outputs (both words and actions) of both current AIs and average humans on these topics are less relevant (for CEV purposes) than the underlying generators of those thoughts and actions.
  Humbly, I agree to this...
  we can be pretty confident that the smartest and most good among us feel love, pain, sorrow, etc. in roughly similar ways to everyone else, and being multiple standard deviations (upwards) among humans for smartness and / or goodness (usually) doesn’t cause a person to do crazy / harmful things. I don’t think we have similarly strong evidence about how AIs generalize even up to that point (let alone beyond).
  ...
  In the spirit of making empirical / falsifiable predictions, a thing that would change my view on this is if AI researchers (or AIs themselves) started producing better philosophical insights about consciousness, metaethics, etc. than the best humans did in 2008, where these insights are grounded by their applicability to and experimental predictions about humans and human consciousness (rather than being self-referential / potentially circular insights about AIs themselves).
  ...but I am thinking if it is required for AI to have similar qualia to humans to be aligned well (for high CEV or other yardsticks). It could just have symbolic equivalents for understanding and reasoning purpose—or even if does not even have that—why would it be impossible to achieve favourable volition in a non-anthropomorphic manner? Can not pure logic and rational reasoning that is devoid of feelings and philosophy be an alternative pathway, even if the end effect is anthropic?
  Maybe a crude example would be that an environmentalist thinks and act favourably towards trees but would be quite unfamiliar with a tree’s internal vibes (assuming trees have some, and by no means I am suggesting that environmentalist is higher placed specie than tree). Still, environmentalist would be grounded in the ethical and scientific reasons for their favourable volition towards the tree.
- Anurag 1 Dec 2025 14:16 UTC
  1 point
  0
  Parent
  My guess is that you lean towards Opus based on a combination of (a) chatting with it for a while and seeing that it says nice things about humans, animals, AIs, etc. in a way that respects those things’ preferences and shows a generalized caring about sentience and (b) running some experiments on its internals to see that these preferences are deep or robust in some way, under various kinds of perturbations.
  Wouldn’t it be great if the creators of public facing models start publishing evaluation results for an organised battery of evaluations especially in the ‘safety and trustworthiness’ category involving biases, ethics, propensities, proofing against misuse, high risk capabilities. For the additional time, effort and resources that would be required to achive this for every release, this would provide a comparison basis—improving public trust and encouraging standardised evolution of evals.