So in other words, when you, a human, ask yourself whether something is or is not “human flourishing”, you’re following a pointer to the full power of your human moral and philosophical reasoning (Valence series §2.7). So no wonder the concept “human flourishing” seems (from your perspective) to generalize well to out-of-distribution scenarios! [...]
By contrast, when an AGI is deciding whether some new situation is or isn’t a good pattern-match to “human flourishing”, it does not have a pointer to the ground-truth human reward-function, and thus the full power of human philosophical introspection.
I feel like this undersells somewhat how good even current under-fitted AIs are at generalizing human moral judgment to novel situations.
My guess is that your moral judgment of world trajectories after 1 day of reflection is closer to what Claude 4 Opus would say than to the 1-day moral judgment of the majority of humans. I share your hope that if we are not speaking about the 1-day moral judgment but something closer to a long reflection, then most humans end up quite close (and in particular the majority ends up closer to you than to Claude 4 Opus) because of the mostly-shared “ground-truth human reward signals”, but I don’t feel very confident in this (p=0.7). If you are more confident than me, I am curious why!
(Just to spell out why I think there is diversity between humans: (1) there might be a lot of path dependence, especially when deciding what the long reflection should look like and how much to tap the human ground truth reward signal and the differences between humans’ current desires are quite large and (2) the ground truth reward signal might differ significantly between humans—there are some well-known edge cases like psychopaths, but there might also be much more mundane diversity.)
(Even if it was the case that Claude 4 Opus was closer to you than to the majority of humans, this is not to say that letting an AI that is as poorly aligned as Claude 4 Opus control the future would be a good idea according to your lights; it would likely be bad both on common sense and ECL grounds.)
I like this framing!
I feel like this undersells somewhat how good even current under-fitted AIs are at generalizing human moral judgment to novel situations.
My guess is that your moral judgment of world trajectories after 1 day of reflection is closer to what Claude 4 Opus would say than to the 1-day moral judgment of the majority of humans. I share your hope that if we are not speaking about the 1-day moral judgment but something closer to a long reflection, then most humans end up quite close (and in particular the majority ends up closer to you than to Claude 4 Opus) because of the mostly-shared “ground-truth human reward signals”, but I don’t feel very confident in this (p=0.7). If you are more confident than me, I am curious why!
(Just to spell out why I think there is diversity between humans: (1) there might be a lot of path dependence, especially when deciding what the long reflection should look like and how much to tap the human ground truth reward signal and the differences between humans’ current desires are quite large and (2) the ground truth reward signal might differ significantly between humans—there are some well-known edge cases like psychopaths, but there might also be much more mundane diversity.)
(Even if it was the case that Claude 4 Opus was closer to you than to the majority of humans, this is not to say that letting an AI that is as poorly aligned as Claude 4 Opus control the future would be a good idea according to your lights; it would likely be bad both on common sense and ECL grounds.)