Ah, but I don’t trust humans to be a trusted source when it comes to what an AI should do with the future lightcone.
First, let’s acknowledge that this is a new objection you are raising which we haven’t discussed yet, eh? I’m tempted to say “moving the goalposts”, but I want to hear your best objections wherever they come from; I just want you to acknowledge that this is in fact a new objection :)
I expect you’d run into something like Scott talks about in The Tails Coming Apart As Metaphor For Life, where humans are making unprincipled and contradictory statements, with not at all enough time spent thinking about the problem.
Scott is describing distributional shift in that essay. Here’s a quote:
The further we go toward the tails, the more extreme the divergences become. Utilitarianism agrees that we should give to charity and shouldn’t steal from the poor, because Utility, but take it far enough to the tails and we should tile the universe with rats on heroin. Religious morality agrees that we should give to charity and shouldn’t steal from the poor, because God, but take it far enough to the tails and we should spend all our time in giant cubes made of semiprecious stones singing songs of praise. Deontology agrees that we should give to charity and shouldn’t steal from the poor, because Rules, but take it far enough to the tails and we all have to be libertarians.
The “distribution” is the set of moral questions that we find ourselves pondering in our everyday lives. Each moral theory (Utilitarianism, religious morality, etc.) is an attempt to make sense of our moral intuitions in a variety of different situations and “fit a curve” through them somehow. The trouble comes when we start considering unusual “off-distribution” moral situations and asking what our moral intuitions say in those situations.
So this isn’t actually a different problem. As Shannon said, once you pare away the extraneous data, you get a simplified problem which represents the core of what needs to be accomplished.
humans are making unprincipled and contradictory statements, with not at all enough time spent thinking about the problem.
Yep. I address this in this comment; search for “The problem is that the overseer has insufficient time to reflect on their true values.”
I somewhat agree, but you could equally well call them “learning human behavior at categorizing images,” “learning human behavior at categorizing sentences,” etc.
Sure, so we just have to learn human behavior at categorizing desired/undesired behavior from our AGI. Approval-direction, essentially.
If we build an AI that does exactly what a human would do in that situation (or what action they would choose as correct when assembling a training set), I would consider that a failure.
If the subject is Paul Christiano, or Carl Shulman, I for one am willing to say these humans are reasonably aligned; and I’m pretty much okay with somebody giving them the keys to the universe in expectation that the keys will later be handed back.
Do you agree or disagree with Eliezer? (In other words, do you think a high-fidelity upload of a benevolent person will result in a good outcome?)
If you disagree, it seems that we have no hope at success whatsoever. If no human can be trusted to act, and AGI is going to arise through our actions, then we can’t be trusted to build it right. So we might as well just give up now.
It also sounds silly to say that one can divide the field into cases where you’re doing model-based reinforcement learning, and cases where you aren’t. The point isn’t the division, it’s that model-based reinforcement learning is solving a specific type of problem.
Sure. So my point is, so far, it hasn’t really proven useful to develop methods to generalize specifically in the case where we are learning human preferences. We haven’t really needed to develop special methods to solve this specific type of problem. (Correct me if I’m wrong.) So this all suggests that it isn’t actually a different problem, fundamentally speaking.
By the way, everything I’ve been saying is about supervised learning, not RL.
I agree with the rest of your comment. I’m focused on the second kind of generalization. As you say, work on the first kind may or may not be useful. I think you can get from the second kind (correctly replicating human labels) to the third kind (“superhuman” labels that the overseer wishes they had thought of themselves) based on active learning, as I described earlier.
“I don’t trust humans to be a trusted source when it comes to what an AI should do with the future lightcone.”
First, let’s acknowledge that this is a new objection you are raising which we haven’t discussed yet, eh? I’m tempted to say “moving the goalposts”, but I want to hear your best objections wherever they come from; I just want you to acknowledge that this is in fact a new objection :)
Sure :) I’ve said similar things elsewhere, but I suppose one must sometimes talk to people who haven’t read one’s every word :P
We’re being pretty vague in describing the human-AI interaction here, but I agree that one reason why the AI shouldn’t just do what it would predict humans would tell it to do (or, if below some threshold of certainty, ask a human) is that humans are not immune to distributional shift.
There are also systematic factors, like preserving your self-image, that sometimes make humans say really dumb things about far-off situations because of more immediate concerns.
Lastly, figuring out what the AI should do with its resources is really hard, and figuring out which to call “better” between two complicated choices can be hard too, and humans will sometimes do badly at it. Worst case, the humans appear to answer hard questions with certainty, or conversely the questions the AI is most uncertain about slowly devolve into giving humans hard questions and treating their answers as strong information.
I think the AI should actively take this stuff into account rather than trying to stay in some context where it can unshakeably trust humans. And by “take this into account,” I’m pretty sure that means model the human and treat preferences as objects in the model.
Skipping over the intervening stuff I agree with, here’s that Eliezer quote:
Eliezer Yudkowsky wrote: “If the subject is Paul Christiano, or Carl Shulman, I for one am willing to say these humans are reasonably aligned; and I’m pretty much okay with somebody giving them the keys to the universe in expectation that the keys will later be handed back.”
Do you agree or disagree with Eliezer? (In other words, do you think a high-fidelity upload of a benevolent person will result in a good outcome?)
If you disagree, it seems that we have no hope at success whatsoever. If no human can be trusted to act, and AGI is going to arise through our actions, then we can’t be trusted to build it right. So we might as well just give up now.
I think Upload Paul Christiano would just go on to work on the alignment problem, which might be useful but is definitely passing the buck.
Though I’m not sure. Maybe Upload Paul Christiano would be capable of taking over the world and handling existential threats before swiftly solving the alignment problem. Then it doesn’t really matter if it’s passing the buck or not.
But my original thought wasn’t about uploads (though that’s definitely a reasonable way to interpret my sentence), it was about copying human decision-making behavior in the same sense that an image classifier copies human image-classifying behavior.
Though maybe you went in the right direction anyhow, and if all you had was supervised learning the right thing to do is to try to copy the decision-making of a single person (not an upload, a sideload). What was that Greg Egan book—Zendegi?
so far, it hasn’t really proven useful to develop methods to generalize specifically in the case where we are learning human preferences. We haven’t really needed to develop special methods to solve this specific type of problem. (Correct me if I’m wrong.)
There are some cases where the AI specifically has a model of the human, and I’d call those “special methods.” Not just IRL, the entire problem of imitation learning often uses specific methods to model humans, like “value iteration networks.” This is the sort of development I’m thinking of that helps AI do a better job at generalizing human values—I’m not sure if you meant things at a lower level, like using a different gradient descent optimization algorithm.
First, let’s acknowledge that this is a new objection you are raising which we haven’t discussed yet, eh? I’m tempted to say “moving the goalposts”, but I want to hear your best objections wherever they come from; I just want you to acknowledge that this is in fact a new objection :)
Scott is describing distributional shift in that essay. Here’s a quote:
The “distribution” is the set of moral questions that we find ourselves pondering in our everyday lives. Each moral theory (Utilitarianism, religious morality, etc.) is an attempt to make sense of our moral intuitions in a variety of different situations and “fit a curve” through them somehow. The trouble comes when we start considering unusual “off-distribution” moral situations and asking what our moral intuitions say in those situations.
So this isn’t actually a different problem. As Shannon said, once you pare away the extraneous data, you get a simplified problem which represents the core of what needs to be accomplished.
Yep. I address this in this comment; search for “The problem is that the overseer has insufficient time to reflect on their true values.”
Sure, so we just have to learn human behavior at categorizing desired/undesired behavior from our AGI. Approval-direction, essentially.
Eliezer Yudkowsky wrote:
Do you agree or disagree with Eliezer? (In other words, do you think a high-fidelity upload of a benevolent person will result in a good outcome?)
If you disagree, it seems that we have no hope at success whatsoever. If no human can be trusted to act, and AGI is going to arise through our actions, then we can’t be trusted to build it right. So we might as well just give up now.
Sure. So my point is, so far, it hasn’t really proven useful to develop methods to generalize specifically in the case where we are learning human preferences. We haven’t really needed to develop special methods to solve this specific type of problem. (Correct me if I’m wrong.) So this all suggests that it isn’t actually a different problem, fundamentally speaking.
By the way, everything I’ve been saying is about supervised learning, not RL.
I agree with the rest of your comment. I’m focused on the second kind of generalization. As you say, work on the first kind may or may not be useful. I think you can get from the second kind (correctly replicating human labels) to the third kind (“superhuman” labels that the overseer wishes they had thought of themselves) based on active learning, as I described earlier.
Sure :) I’ve said similar things elsewhere, but I suppose one must sometimes talk to people who haven’t read one’s every word :P
We’re being pretty vague in describing the human-AI interaction here, but I agree that one reason why the AI shouldn’t just do what it would predict humans would tell it to do (or, if below some threshold of certainty, ask a human) is that humans are not immune to distributional shift.
There are also systematic factors, like preserving your self-image, that sometimes make humans say really dumb things about far-off situations because of more immediate concerns.
Lastly, figuring out what the AI should do with its resources is really hard, and figuring out which to call “better” between two complicated choices can be hard too, and humans will sometimes do badly at it. Worst case, the humans appear to answer hard questions with certainty, or conversely the questions the AI is most uncertain about slowly devolve into giving humans hard questions and treating their answers as strong information.
I think the AI should actively take this stuff into account rather than trying to stay in some context where it can unshakeably trust humans. And by “take this into account,” I’m pretty sure that means model the human and treat preferences as objects in the model.
Skipping over the intervening stuff I agree with, here’s that Eliezer quote:
I think Upload Paul Christiano would just go on to work on the alignment problem, which might be useful but is definitely passing the buck.
Though I’m not sure. Maybe Upload Paul Christiano would be capable of taking over the world and handling existential threats before swiftly solving the alignment problem. Then it doesn’t really matter if it’s passing the buck or not.
But my original thought wasn’t about uploads (though that’s definitely a reasonable way to interpret my sentence), it was about copying human decision-making behavior in the same sense that an image classifier copies human image-classifying behavior.
Though maybe you went in the right direction anyhow, and if all you had was supervised learning the right thing to do is to try to copy the decision-making of a single person (not an upload, a sideload). What was that Greg Egan book—Zendegi?
There are some cases where the AI specifically has a model of the human, and I’d call those “special methods.” Not just IRL, the entire problem of imitation learning often uses specific methods to model humans, like “value iteration networks.” This is the sort of development I’m thinking of that helps AI do a better job at generalizing human values—I’m not sure if you meant things at a lower level, like using a different gradient descent optimization algorithm.