This issue with PCEV runs into a general problem with alignment targets: should you aim for what’s objectively good, or for what agents can agree on?
(I’m going to go out on a limb and say that the fanatics’ preferences in your thought experiment are objectively bad.)
You can make claims like “PCEV could result in an objectively bad(ish) outcome due to fanatics’ preferences”, in which case why not say fanatics are excluded from PCEV? Why not just lean into doing the thing that’s objectively good?
Pragmatically, the problem with excluding people from CEV (or baking in a morality/axiology) is that it makes it harder for people to agree on an alignment target, and you might end up with people warring over an alignment target—and the war could be catastrophically bad. But fanatics are sufficiently unpopular that it seems fine in this case. In fact I would guess that zero humans’ CEV[1] is fanatical in this way—people who advocate for eternal torture are confused, and would not endorse this upon reflection.
But this introduces the more complicated question of “what pragmatics/special cases should be considered?” That question makes theoretical work hard, although e.g. Claude’s Constitution is basically a giant list of (often internally contradictory) special cases. I don’t think the way Claude’s Constitution specifies an alignment target will scale to ASI because the contradictions become untenable.
Separately, I think the fanatics problem is unlikely to matter in practice because (1) true fanatics are rare and therefore their vote counts for little; (2) Hedonistic Imperative-style welfare-optimized minds probably have symmetric-ish happiness and suffering, unlike in evolved beings where max suffering >>> max happiness; and this dampens how bad it is to introduce suffering as the result of a negotiation. Although I’m not overwhelmingly confident about either of those points.
[1] insofar as individual humans have a CEV, which actually I don’t think most people do, or at least you need some method of resolving internal contradictions. resolving contradictions is impossible in theory, but in practice it still happens somehow (sometimes). but that’s a whole other issue
This issue with PCEV runs into a general problem with alignment targets: should you aim for what’s objectively good, or for what agents can agree on?
(I’m going to go out on a limb and say that the fanatics’ preferences in your thought experiment are objectively bad.)
You can make claims like “PCEV could result in an objectively bad(ish) outcome due to fanatics’ preferences”, in which case why not say fanatics are excluded from PCEV? Why not just lean into doing the thing that’s objectively good?
Pragmatically, the problem with excluding people from CEV (or baking in a morality/axiology) is that it makes it harder for people to agree on an alignment target, and you might end up with people warring over an alignment target—and the war could be catastrophically bad. But fanatics are sufficiently unpopular that it seems fine in this case. In fact I would guess that zero humans’ CEV[1] is fanatical in this way—people who advocate for eternal torture are confused, and would not endorse this upon reflection.
But this introduces the more complicated question of “what pragmatics/special cases should be considered?” That question makes theoretical work hard, although e.g. Claude’s Constitution is basically a giant list of (often internally contradictory) special cases. I don’t think the way Claude’s Constitution specifies an alignment target will scale to ASI because the contradictions become untenable.
Separately, I think the fanatics problem is unlikely to matter in practice because (1) true fanatics are rare and therefore their vote counts for little; (2) Hedonistic Imperative-style welfare-optimized minds probably have symmetric-ish happiness and suffering, unlike in evolved beings where max suffering >>> max happiness; and this dampens how bad it is to introduce suffering as the result of a negotiation. Although I’m not overwhelmingly confident about either of those points.
[1] insofar as individual humans have a CEV, which actually I don’t think most people do, or at least you need some method of resolving internal contradictions. resolving contradictions is impossible in theory, but in practice it still happens somehow (sometimes). but that’s a whole other issue