Daniel Kokotajlo comments on Goal Alignment Is Robust To the Sharp Left Turn

Daniel Kokotajlo 14 Jul 2022 20:17 UTC
20 points
12
I came here to make this comment, but since you’ve already made it, I will instead say a small note in the opposite direction, which is that even despite all the things you’ve said it still seems like past humans and present humans are mostly aligned. In that the CEV of past humans is probably OK by the standards of the CEV of present humans. Yes, a lot of work here is being done by the “CE” part—I’m claiming that after reflection the people in the past would probably be happy with fake cats rather than real cats, if they still wanted to torture cats at all.
- Vladimir_Nesov 15 Jul 2022 13:48 UTC
  10 points
  4
  Parent
  The hypothesis is that CEV of past humans is fine from the point of view of CEV of modern humans. This is similar-to/predicted-by the generic value hypothesis I’ve been mulling over for the last month, which says that there is a convergent CEV for many agents with ostensibly different current volitions.
  
  This is plausible for agents that are not mature optimizers, so that the process of extrapolating their volition does more work in selecting the resulting preference than their initial attitudes. Extrapolation of the long reflection vibe could be largely insensitive to the initial attitudes/volition depending on how volition extrapolation works and what kind of thing are the values it primarily produces (something that traditionally isn’t a topic of meaningful discussion). If generic value hypothesis holds, it might put the CEV of a wide variety of AGIs (including those only very loosely aligned) close enough to CEV of humanity to prefer a valuable future. It’s more likely to hold for AGIs that have less legible preferences (don’t hold some proxy values as a strongly reflectively-endorsed optimization target, leaving less influence for volition extrapolation), and for larger coalitions of AGIs of different make, canceling out idiosyncrasies in their individual initial attitudes.
  
  I think this is unlikely to hold in the strong sense where cosmic endowment is used by probable AGIs in a way that’s seen as highly valuable by CEV of humanity. But I’m guessing it’s somewhat likely to hold in the weak sense where probable AGIs end up giving humanity a bit of computational welfare greater than literally nothing.
  What links here?
  - Vladimir_Nesov's comment on I missed the crux of the alignment problem the whole time by zeshen (13 Aug 2022 12:08 UTC; 8 points)
  - Vladimir_Nesov's comment on All AGI safety questions welcome (especially basic ones) [July 2022] by plex (24 Jul 2022 9:00 UTC; 4 points)