I’m a little confused what it hopes to accomplish. I mean, to start I’m a little confused by your example of “preferences not about future states” (i.e. ‘the pizza shop employee is running around frantically, and I am laughing’ is a future state).
But to me, I’m not sure what the mixing of “paperclips” vs “humans remain in control” accomplishes. On the one hand, I think if you can specify “humans remain in control” safely, you’ve solved the alignment problem already. On another, I wouldn’t want that to seize the future: There are potentially much better futures where humans are not in control, but still alive/free/whatever. (e.g. the Sophotechs in the Golden Oecumene are very much in control). On a third, I would definitely, a lot, very much, prefer a 3 star ‘paperclips’ and 5 star ‘humans in control’ to a 5 star ‘paperclips’ and a 3 star ‘humans in control’, even though both would average 4 stars?
‘the pizza shop employee is running around frantically, and I am laughing’ is a future state
In my post I wrote: “To be more concrete, if I’m deciding between two possible courses of action, A and B, “preference over future states” would make the decision based on the state of the world after I finish the course of action—or more centrally, long after I finish the course of action. By contrast, “other kinds of preferences” would allow the decision to depend on anything, even including what happens during the course-of-action.”
So “the humans will ultimately wind up in control” would be a preference-over-future-states, and this preference would allow (indeed encourage) the AGI to disempower and later re-empower humans. By contrast, “the humans will remain in control” is not a pure preference-over-future-states, and relatedly does not encourage the AGI to disempower and later re-empower humans.
There are potentially much better futures where humans are not in control
If we knew exactly what long-term future we wanted, and we knew how to build an AGI that definitely also wanted that exact same long-term future, then we should certainly do that, instead of making a corrigible AGI. Unfortunately, we don’t know those things right now, so under the circumstances, knowing how to make a corrigible AGI would be a useful thing to know how to do.
Also, this is not a hyper-specific corrigibility proposal; it’s really a general AGI-motivation-sculpting proposal, applied to corrigibility. So even if you’re totally opposed to corrigibility, you can still take an interest in the question of whether or not my proposal is fundamentally doomed. Because I think everyone agrees that AGI-motivation-sculpting is necessary.
I would definitely, a lot, very much, prefer a 3 star ‘paperclips’ and 5 star ‘humans in control’ to a 5 star ‘paperclips’ and a 3 star ‘humans in control’, even though both would average 4 stars?
It could be a weighted average. It could be a weighted average plus a nonlinear acceptability threshold on “humans in control”. It could be other things. I don’t know; this is one of many important open questions.
I think if you can specify “humans remain in control” safely, you’ve solved the alignment problem already
Am I correct after reading this that this post is heavily related to embedded agency? I may have misunderstood the general attitudes, but I thought of “future states” as “future to now” not “future to my action.” It seems like you couldn’t possibly create a thing that works on the last one, unless you intend it to set everything in motion and then terminate. In the embedded agency sequence, they point out that embedded agents don’t have well defined i/o channels. One way is that “action” is not a well defined term, and is often not atomic.
I’m not sure I interpret corrigibility as exactly the same as “preferring the humans remain in control” (I see you suggest this yourself in Objection 1, I wrote this before I reread that, but I’m going to leave it as is) and if you programmed that preference into a non-corrigible AI, it would still seize the future into states where the humans have to remain in control. Better than doom, but not ideal if we can avoid it with actual corrigibility.
But I think I miscommunicated, because, besides the above, I agree with everything else in those two paragraphs.
I think I maintain that this feels like it doesn’t solve much. Much of the discussion in the Yudkowsky conversations was that there’s a concern on how to point powerful systems in any direction. Your response to objection 1 admits you don’t claim this solves that, but that’s most of the problem. If we do solve the problem of how to point a system at some abstract concept, why would we choose “the humans remain in control” and not “pursue humanity’s CEV”? Do you expect “the humans remain in control” (or the combination of concepts you propose as an alternative) to be easier to define? Easier enough to define that it’s worth pursuing even if we might choose the wrong combination of concepts? Do you expect something else?
I’m a little confused what it hopes to accomplish. I mean, to start I’m a little confused by your example of “preferences not about future states” (i.e. ‘the pizza shop employee is running around frantically, and I am laughing’ is a future state).
But to me, I’m not sure what the mixing of “paperclips” vs “humans remain in control” accomplishes. On the one hand, I think if you can specify “humans remain in control” safely, you’ve solved the alignment problem already. On another, I wouldn’t want that to seize the future: There are potentially much better futures where humans are not in control, but still alive/free/whatever. (e.g. the Sophotechs in the Golden Oecumene are very much in control). On a third, I would definitely, a lot, very much, prefer a 3 star ‘paperclips’ and 5 star ‘humans in control’ to a 5 star ‘paperclips’ and a 3 star ‘humans in control’, even though both would average 4 stars?
In my post I wrote: “To be more concrete, if I’m deciding between two possible courses of action, A and B, “preference over future states” would make the decision based on the state of the world after I finish the course of action—or more centrally, long after I finish the course of action. By contrast, “other kinds of preferences” would allow the decision to depend on anything, even including what happens during the course-of-action.”
So “the humans will ultimately wind up in control” would be a preference-over-future-states, and this preference would allow (indeed encourage) the AGI to disempower and later re-empower humans. By contrast, “the humans will remain in control” is not a pure preference-over-future-states, and relatedly does not encourage the AGI to disempower and later re-empower humans.
If we knew exactly what long-term future we wanted, and we knew how to build an AGI that definitely also wanted that exact same long-term future, then we should certainly do that, instead of making a corrigible AGI. Unfortunately, we don’t know those things right now, so under the circumstances, knowing how to make a corrigible AGI would be a useful thing to know how to do.
Also, this is not a hyper-specific corrigibility proposal; it’s really a general AGI-motivation-sculpting proposal, applied to corrigibility. So even if you’re totally opposed to corrigibility, you can still take an interest in the question of whether or not my proposal is fundamentally doomed. Because I think everyone agrees that AGI-motivation-sculpting is necessary.
It could be a weighted average. It could be a weighted average plus a nonlinear acceptability threshold on “humans in control”. It could be other things. I don’t know; this is one of many important open questions.
See discussion under “Objection 1” in my post.
Am I correct after reading this that this post is heavily related to embedded agency? I may have misunderstood the general attitudes, but I thought of “future states” as “future to now” not “future to my action.” It seems like you couldn’t possibly create a thing that works on the last one, unless you intend it to set everything in motion and then terminate. In the embedded agency sequence, they point out that embedded agents don’t have well defined i/o channels. One way is that “action” is not a well defined term, and is often not atomic.
It also sounds like you’re trying to suggest that we should be judging trajectories, not states? I just want to note that this is, as far as I can tell, the plan: https://www.lesswrong.com/posts/K4aGvLnHvYgX9pZHS/the-fun-theory-sequence
From the synopsis of High Challenge
I’m not sure I interpret corrigibility as exactly the same as “preferring the humans remain in control” (I see you suggest this yourself in Objection 1, I wrote this before I reread that, but I’m going to leave it as is) and if you programmed that preference into a non-corrigible AI, it would still seize the future into states where the humans have to remain in control. Better than doom, but not ideal if we can avoid it with actual corrigibility.
But I think I miscommunicated, because, besides the above, I agree with everything else in those two paragraphs.
I think I maintain that this feels like it doesn’t solve much. Much of the discussion in the Yudkowsky conversations was that there’s a concern on how to point powerful systems in any direction. Your response to objection 1 admits you don’t claim this solves that, but that’s most of the problem. If we do solve the problem of how to point a system at some abstract concept, why would we choose “the humans remain in control” and not “pursue humanity’s CEV”? Do you expect “the humans remain in control” (or the combination of concepts you propose as an alternative) to be easier to define? Easier enough to define that it’s worth pursuing even if we might choose the wrong combination of concepts? Do you expect something else?