I came here to make this comment, but since you’ve already made it, I will instead say a small note in the opposite direction, which is that even despite all the things you’ve said it still seems like past humans and present humans are mostly aligned. In that the CEV of past humans is probably OK by the standards of the CEV of present humans. Yes, a lot of work here is being done by the “CE” part—I’m claiming that after reflection the people in the past would probably be happy with fake cats rather than real cats, if they still wanted to torture cats at all.
The hypothesis is that CEV of past humans is fine from the point of view of CEV of modern humans. This is similar-to/predicted-by the generic value hypothesis I’ve been mullingover for the last month, which says that there is a convergent CEV for many agents with ostensibly different current volitions.
This is plausible for agents that are not matureoptimizers, so that the process of extrapolating their volition does more work in selecting the resulting preference than their initial attitudes. Extrapolation of the longreflectionvibe could be largely insensitive to the initial attitudes/volition depending on how volition extrapolation works and what kind of thing are the values it primarily produces (something that traditionally isn’t a topic of meaningful discussion). If generic value hypothesis holds, it might put the CEV of a wide variety of AGIs (including those only very loosely aligned) close enough to CEV of humanity to prefer a valuable future. It’s more likely to hold for AGIs that have less legible preferences (don’t hold some proxy values as a strongly reflectively-endorsed optimization target, leaving less influence for volition extrapolation), and for larger coalitions of AGIs of different make, canceling out idiosyncrasies in their individual initial attitudes.
I think this is unlikely to hold in the strong sense where cosmic endowment is used by probable AGIs in a way that’s seen as highly valuable by CEV of humanity. But I’m guessing it’s somewhat likely to hold in the weak sense where probable AGIs end up giving humanity a bit of computational welfare greater than literally nothing.
I came here to make this comment, but since you’ve already made it, I will instead say a small note in the opposite direction, which is that even despite all the things you’ve said it still seems like past humans and present humans are mostly aligned. In that the CEV of past humans is probably OK by the standards of the CEV of present humans. Yes, a lot of work here is being done by the “CE” part—I’m claiming that after reflection the people in the past would probably be happy with fake cats rather than real cats, if they still wanted to torture cats at all.
The hypothesis is that CEV of past humans is fine from the point of view of CEV of modern humans. This is similar-to/predicted-by the generic value hypothesis I’ve been mulling over for the last month, which says that there is a convergent CEV for many agents with ostensibly different current volitions.
This is plausible for agents that are not mature optimizers, so that the process of extrapolating their volition does more work in selecting the resulting preference than their initial attitudes. Extrapolation of the long reflection vibe could be largely insensitive to the initial attitudes/volition depending on how volition extrapolation works and what kind of thing are the values it primarily produces (something that traditionally isn’t a topic of meaningful discussion). If generic value hypothesis holds, it might put the CEV of a wide variety of AGIs (including those only very loosely aligned) close enough to CEV of humanity to prefer a valuable future. It’s more likely to hold for AGIs that have less legible preferences (don’t hold some proxy values as a strongly reflectively-endorsed optimization target, leaving less influence for volition extrapolation), and for larger coalitions of AGIs of different make, canceling out idiosyncrasies in their individual initial attitudes.
I think this is unlikely to hold in the strong sense where cosmic endowment is used by probable AGIs in a way that’s seen as highly valuable by CEV of humanity. But I’m guessing it’s somewhat likely to hold in the weak sense where probable AGIs end up giving humanity a bit of computational welfare greater than literally nothing.