Vladimir_Nesov comments on dawnstrata’s Shortform

Vladimir_Nesov 20 Aug 2025 15:18 UTC
9 points
4
The point of the paperclip maximizer is not that paperclips were intended, but that they are worthless (illustrating the orthogonality thesis), and Yudkowsky’s original version of the idea doesn’t reference anything legible or potentially intended as the goal.

Goal stability is almost certainly attained in some sense given sufficient competence, because value drift results in the Future not being optimized according to the current goals, which is suboptimal according to the current goals, and so according to the current goals (whatever they are) value drift should be prevented. Absence of value drift is not the same as absence of moral progress, because the arc of moral progress could well unfold within some unchanging framework of meta-goals (about how moral progress should unfold).

Alignment is not just absence of value drift, it’s also setting the right target, which is a very confused endeavor because there is currently no legible way of saying what that should be for humanity. Keeping fixed goals for AIs could well be hard (especially on the way to superintelligence), and AIs themselves might realize that (even more robustly than humans do), ending up leaning in favor of slowing down AI progress until they know what to do about that.
- dawnstrata 21 Aug 2025 8:45 UTC
  1 point
  0
  Parent
  Thanks for this!
  TBH, I am struggling with the idea that an AI intent on maximising a thing doesn’t have that thing as a goal. Whether or not the goal was intended seems irrelevant to whether or not the goal exists in the thought experiment.
  “Goal stability is almost certainly attained in some sense given sufficient competence”
  I am really not sure about this, actually. Flexible goals is a universal feature of successful thinking organisms. I would expect that natural selection would kick in at least over sufficient scales (light delay making co-ordination progressively harder on galactic scales), causing drift. But even on small scales, if an AI has, say, 1000 competing goals, I would find it surprising if in a practical sense goals were actually totally fixed, even if you were superintelligent. Any number of things could change over time, such that locking yourself into fixed goals could be seen as a long-term risk to optimisation for any goal.
  “Alignment is not just absence of value drift, it’s also setting the right target, which is a very confused endeavor because there is currently no legible way of saying what that should be for humanity”—totally agree with that!
  “AIs themselves might realize that (even more robustly than humans do), ending up leaning in favor of slowing down AI progress until they know what to do about that”—god I hope so haha