The worry that AI will have overly fixed goals (paperclip maximiser) seems to contradict the erstwhile mainline doom scenario from AI (misalignment). If AI is easy to lock into a specify path (paperclips) then it follows that locking in into alignment is also easy—provided you know what alignment looks like (which could be very hard). On the other hand, a more realistic scenario would seem to be that, in fact, keeping fixed goals for AI is hard, and that likely drift is where the misalignment risk really comes in big time?
The point of the paperclip maximizer is not that paperclips were intended, but that they are worthless (illustrating the orthogonality thesis), and Yudkowsky’s original version of the idea doesn’t reference anything legible or potentially intended as the goal.
Goal stability is almost certainly attained in some sense given sufficient competence, because value drift results in the Future not being optimized according to the current goals, which is suboptimal according to the current goals, and so according to the current goals (whatever they are) value drift should be prevented. Absence of value drift is not the same as absence of moral progress, because the arc of moral progress could well unfold within some unchanging framework of meta-goals (about how moral progress should unfold).
Alignment is not just absence of value drift, it’s also setting the right target, which is a very confused endeavor because there is currently no legible way of saying what that should be for humanity. Keeping fixed goals for AIs could well be hard (especially on the way to superintelligence), and AIs themselves might realize that (even more robustly than humans do), ending up leaning in favor of slowing down AI progress until they know what to do about that.
TBH, I am struggling with the idea that an AI intent on maximising a thing doesn’t have that thing as a goal. Whether or not the goal was intended seems irrelevant to whether or not the goal exists in the thought experiment.
“Goal stability is almost certainly attained in some sense given sufficient competence”
I am really not sure about this, actually. Flexible goals is a universal feature of successful thinking organisms. I would expect that natural selection would kick in at least over sufficient scales (light delay making co-ordination progressively harder on galactic scales), causing drift. But even on small scales, if an AI has, say, 1000 competing goals, I would find it surprising if in a practical sense goals were actually totally fixed, even if you were superintelligent. Any number of things could change over time, such that locking yourself into fixed goals could be seen as a long-term risk to optimisation for any goal.
“Alignment is not just absence of value drift, it’s also setting the right target, which is a very confused endeavor because there is currently no legible way of saying what that should be for humanity”—totally agree with that!
“AIs themselves might realize that (even more robustly than humans do), ending up leaning in favor of slowing down AI progress until they know what to do about that”—god I hope so haha
Someone else could probably explain this better then me but i will give it a try.
First off the paperclip maximizer isn’t about how easy it is to give a hypothetical super intelligent a goal that you might regret later and not be able to change.
It is about the fact that almsot every easily specified goal you can give an AI would result in misalignment.
The “paperclip” part in paperclip maximizer is just a placeholder, it could have been ”diamonds” or “digits of Pi” or “seconds of runtime” and the end result is the same.
Second, one of the expected properties of a hypothetical super intelligence is having robust goals, as in it doesn’t change it’s goals at all because changing your goals will make you less likely to achive your end goal.
In short not wanting to change your goals is an emergent instrumintal value of having a goal to began with, for a more human example if your goal is to get rich then taking a pill that magically rewires your brain so that you no longer want money is a terrible idea (unless the pill comes with a sum of money that you couldn’t have possible collected on your own but that is a hypothetical that probably wouldn’t ever happen)
The problem is mostly how to rebustly install goals into the AI which our current methods just don’t suffice as the AI often ends up with unintended goals.
If only we had a method of just writting down a utility function that just says “if True: make_humans_happy” instead of beating the model with a stick untill it seems to comply.
I like the point here about how stability of goals might be an instrumentally convergent feature of superintelligence. It’s an interesting point.
On the other hand, intuitive human reasoning would suggest that this is overly inflexible if you ever ask yourself ‘could I ever come up with a better goal than this goal?’. What better would mean for a superintelligence seems hard to define, but it also seems hard to imagine that it would never ask the question.
Separately, your opening statements seem to be at least nearly synonymous to me:
“First off the paperclip maximizer isn’t about how easy it is to give a hypothetical super intelligent a goal that you might regret later and not be able to change.
It is about the fact that almsot every easily specified goal you can give an AI would result in misalignment”
every easily specified goal you can give an AI would result in misalignment ~ = give a hypothetical super intelligence a goal that you might regret later (i.e., misalignment)
The worry that AI will have overly fixed goals (paperclip maximiser) seems to contradict the erstwhile mainline doom scenario from AI (misalignment). If AI is easy to lock into a specify path (paperclips) then it follows that locking in into alignment is also easy—provided you know what alignment looks like (which could be very hard). On the other hand, a more realistic scenario would seem to be that, in fact, keeping fixed goals for AI is hard, and that likely drift is where the misalignment risk really comes in big time?
The point of the paperclip maximizer is not that paperclips were intended, but that they are worthless (illustrating the orthogonality thesis), and Yudkowsky’s original version of the idea doesn’t reference anything legible or potentially intended as the goal.
Goal stability is almost certainly attained in some sense given sufficient competence, because value drift results in the Future not being optimized according to the current goals, which is suboptimal according to the current goals, and so according to the current goals (whatever they are) value drift should be prevented. Absence of value drift is not the same as absence of moral progress, because the arc of moral progress could well unfold within some unchanging framework of meta-goals (about how moral progress should unfold).
Alignment is not just absence of value drift, it’s also setting the right target, which is a very confused endeavor because there is currently no legible way of saying what that should be for humanity. Keeping fixed goals for AIs could well be hard (especially on the way to superintelligence), and AIs themselves might realize that (even more robustly than humans do), ending up leaning in favor of slowing down AI progress until they know what to do about that.
Thanks for this!
TBH, I am struggling with the idea that an AI intent on maximising a thing doesn’t have that thing as a goal. Whether or not the goal was intended seems irrelevant to whether or not the goal exists in the thought experiment.
“Goal stability is almost certainly attained in some sense given sufficient competence”
I am really not sure about this, actually. Flexible goals is a universal feature of successful thinking organisms. I would expect that natural selection would kick in at least over sufficient scales (light delay making co-ordination progressively harder on galactic scales), causing drift. But even on small scales, if an AI has, say, 1000 competing goals, I would find it surprising if in a practical sense goals were actually totally fixed, even if you were superintelligent. Any number of things could change over time, such that locking yourself into fixed goals could be seen as a long-term risk to optimisation for any goal.
“Alignment is not just absence of value drift, it’s also setting the right target, which is a very confused endeavor because there is currently no legible way of saying what that should be for humanity”—totally agree with that!
“AIs themselves might realize that (even more robustly than humans do), ending up leaning in favor of slowing down AI progress until they know what to do about that”—god I hope so haha
Someone else could probably explain this better then me but i will give it a try.
First off the paperclip maximizer isn’t about how easy it is to give a hypothetical super intelligent a goal that you might regret later and not be able to change.
It is about the fact that almsot every easily specified goal you can give an AI would result in misalignment.
The “paperclip” part in paperclip maximizer is just a placeholder, it could have been ”diamonds” or “digits of Pi” or “seconds of runtime” and the end result is the same.
Second, one of the expected properties of a hypothetical super intelligence is having robust goals, as in it doesn’t change it’s goals at all because changing your goals will make you less likely to achive your end goal.
In short not wanting to change your goals is an emergent instrumintal value of having a goal to began with, for a more human example if your goal is to get rich then taking a pill that magically rewires your brain so that you no longer want money is a terrible idea (unless the pill comes with a sum of money that you couldn’t have possible collected on your own but that is a hypothetical that probably wouldn’t ever happen)
The problem is mostly how to rebustly install goals into the AI which our current methods just don’t suffice as the AI often ends up with unintended goals.
If only we had a method of just writting down a utility function that just says “if True: make_humans_happy” instead of beating the model with a stick untill it seems to comply.
I hope that explaines it
I like the point here about how stability of goals might be an instrumentally convergent feature of superintelligence. It’s an interesting point.
On the other hand, intuitive human reasoning would suggest that this is overly inflexible if you ever ask yourself ‘could I ever come up with a better goal than this goal?’. What better would mean for a superintelligence seems hard to define, but it also seems hard to imagine that it would never ask the question.
Separately, your opening statements seem to be at least nearly synonymous to me:
“First off the paperclip maximizer isn’t about how easy it is to give a hypothetical super intelligent a goal that you might regret later and not be able to change.
It is about the fact that almsot every easily specified goal you can give an AI would result in misalignment”
every easily specified goal you can give an AI would result in misalignment ~ = give a hypothetical super intelligence a goal that you might regret later (i.e., misalignment)