“Even if the Go-playing AI couldn’t modify itself to only care about the current way it computes values of actions, it might make suboptimal moves that limit its future options, because its future self will compute values of actions the ‘wrong’ way.”
Please correct me if I’m misunderstanding something: why would a value-learner care about retaining its current values? I’m having trouble seeing the jump from the Go planning process and the statement that a Dewey learner of sufficient intelligence would want to self-sabotage.
Suppose that you were a hedonist, and that your decision-making process was to only care about the next three years. So you have a genius plan—you’ll take out a loan that you don’t have to pay back for 3 years, and then spend the money as hedonistically as possible, and then after those 3 years are up you’ll probably lose your house or get convicted of fraud or something but whatever.
But then you realize that your future selves also care about the next three years, for them. And so in two years your future self is going to be all stressed out and focused on paying off the loan or going into hiding in Zimbabwe or something, which detracts from your genius plan. So the really genius plan that gets the most utility over the next three years would both take out the loan, and also somehow ensure your future self had a good time and didn’t, like, worry about paying back the loan.
Attempt two:
Check out the example used in this paper of a “sophisticated planner” (figure 1). It realizes that its decision-making criteria are going to drift over time, so it takes a suboptimal route so that its future self can’t screw up the genius plan. When we approve of the past agent’s values we call this “forward thinking” and “sophisticated,” but when we don’t favor the past agent over its future selves, we call it “self-sabotage.”
This helps, thank you. I almost objected by saying something like “I have a lot of goals that would be better achieved by a better decision making process, or a different decision making process” but once you’ve altered that, there’s not a perfect guarantee that your goals will remain the same.
I actually typed out a bunch of responses, but got to the point where I’m not on-topic anymore. I think I understand the challenge a little better now, though!
“Even if the Go-playing AI couldn’t modify itself to only care about the current way it computes values of actions, it might make suboptimal moves that limit its future options, because its future self will compute values of actions the ‘wrong’ way.”
Please correct me if I’m misunderstanding something: why would a value-learner care about retaining its current values? I’m having trouble seeing the jump from the Go planning process and the statement that a Dewey learner of sufficient intelligence would want to self-sabotage.
Attempt one:
Suppose that you were a hedonist, and that your decision-making process was to only care about the next three years. So you have a genius plan—you’ll take out a loan that you don’t have to pay back for 3 years, and then spend the money as hedonistically as possible, and then after those 3 years are up you’ll probably lose your house or get convicted of fraud or something but whatever.
But then you realize that your future selves also care about the next three years, for them. And so in two years your future self is going to be all stressed out and focused on paying off the loan or going into hiding in Zimbabwe or something, which detracts from your genius plan. So the really genius plan that gets the most utility over the next three years would both take out the loan, and also somehow ensure your future self had a good time and didn’t, like, worry about paying back the loan.
Attempt two:
Check out the example used in this paper of a “sophisticated planner” (figure 1). It realizes that its decision-making criteria are going to drift over time, so it takes a suboptimal route so that its future self can’t screw up the genius plan. When we approve of the past agent’s values we call this “forward thinking” and “sophisticated,” but when we don’t favor the past agent over its future selves, we call it “self-sabotage.”
This helps, thank you. I almost objected by saying something like “I have a lot of goals that would be better achieved by a better decision making process, or a different decision making process” but once you’ve altered that, there’s not a perfect guarantee that your goals will remain the same.
I actually typed out a bunch of responses, but got to the point where I’m not on-topic anymore. I think I understand the challenge a little better now, though!