“I don’t trust humans to be a trusted source when it comes to what an AI should do with the future lightcone.”
First, let’s acknowledge that this is a new objection you are raising which we haven’t discussed yet, eh? I’m tempted to say “moving the goalposts”, but I want to hear your best objections wherever they come from; I just want you to acknowledge that this is in fact a new objection :)
Sure :) I’ve said similar things elsewhere, but I suppose one must sometimes talk to people who haven’t read one’s every word :P
We’re being pretty vague in describing the human-AI interaction here, but I agree that one reason why the AI shouldn’t just do what it would predict humans would tell it to do (or, if below some threshold of certainty, ask a human) is that humans are not immune to distributional shift.
There are also systematic factors, like preserving your self-image, that sometimes make humans say really dumb things about far-off situations because of more immediate concerns.
Lastly, figuring out what the AI should do with its resources is really hard, and figuring out which to call “better” between two complicated choices can be hard too, and humans will sometimes do badly at it. Worst case, the humans appear to answer hard questions with certainty, or conversely the questions the AI is most uncertain about slowly devolve into giving humans hard questions and treating their answers as strong information.
I think the AI should actively take this stuff into account rather than trying to stay in some context where it can unshakeably trust humans. And by “take this into account,” I’m pretty sure that means model the human and treat preferences as objects in the model.
Skipping over the intervening stuff I agree with, here’s that Eliezer quote:
Eliezer Yudkowsky wrote: “If the subject is Paul Christiano, or Carl Shulman, I for one am willing to say these humans are reasonably aligned; and I’m pretty much okay with somebody giving them the keys to the universe in expectation that the keys will later be handed back.”
Do you agree or disagree with Eliezer? (In other words, do you think a high-fidelity upload of a benevolent person will result in a good outcome?)
If you disagree, it seems that we have no hope at success whatsoever. If no human can be trusted to act, and AGI is going to arise through our actions, then we can’t be trusted to build it right. So we might as well just give up now.
I think Upload Paul Christiano would just go on to work on the alignment problem, which might be useful but is definitely passing the buck.
Though I’m not sure. Maybe Upload Paul Christiano would be capable of taking over the world and handling existential threats before swiftly solving the alignment problem. Then it doesn’t really matter if it’s passing the buck or not.
But my original thought wasn’t about uploads (though that’s definitely a reasonable way to interpret my sentence), it was about copying human decision-making behavior in the same sense that an image classifier copies human image-classifying behavior.
Though maybe you went in the right direction anyhow, and if all you had was supervised learning the right thing to do is to try to copy the decision-making of a single person (not an upload, a sideload). What was that Greg Egan book—Zendegi?
so far, it hasn’t really proven useful to develop methods to generalize specifically in the case where we are learning human preferences. We haven’t really needed to develop special methods to solve this specific type of problem. (Correct me if I’m wrong.)
There are some cases where the AI specifically has a model of the human, and I’d call those “special methods.” Not just IRL, the entire problem of imitation learning often uses specific methods to model humans, like “value iteration networks.” This is the sort of development I’m thinking of that helps AI do a better job at generalizing human values—I’m not sure if you meant things at a lower level, like using a different gradient descent optimization algorithm.
Sure :) I’ve said similar things elsewhere, but I suppose one must sometimes talk to people who haven’t read one’s every word :P
We’re being pretty vague in describing the human-AI interaction here, but I agree that one reason why the AI shouldn’t just do what it would predict humans would tell it to do (or, if below some threshold of certainty, ask a human) is that humans are not immune to distributional shift.
There are also systematic factors, like preserving your self-image, that sometimes make humans say really dumb things about far-off situations because of more immediate concerns.
Lastly, figuring out what the AI should do with its resources is really hard, and figuring out which to call “better” between two complicated choices can be hard too, and humans will sometimes do badly at it. Worst case, the humans appear to answer hard questions with certainty, or conversely the questions the AI is most uncertain about slowly devolve into giving humans hard questions and treating their answers as strong information.
I think the AI should actively take this stuff into account rather than trying to stay in some context where it can unshakeably trust humans. And by “take this into account,” I’m pretty sure that means model the human and treat preferences as objects in the model.
Skipping over the intervening stuff I agree with, here’s that Eliezer quote:
I think Upload Paul Christiano would just go on to work on the alignment problem, which might be useful but is definitely passing the buck.
Though I’m not sure. Maybe Upload Paul Christiano would be capable of taking over the world and handling existential threats before swiftly solving the alignment problem. Then it doesn’t really matter if it’s passing the buck or not.
But my original thought wasn’t about uploads (though that’s definitely a reasonable way to interpret my sentence), it was about copying human decision-making behavior in the same sense that an image classifier copies human image-classifying behavior.
Though maybe you went in the right direction anyhow, and if all you had was supervised learning the right thing to do is to try to copy the decision-making of a single person (not an upload, a sideload). What was that Greg Egan book—Zendegi?
There are some cases where the AI specifically has a model of the human, and I’d call those “special methods.” Not just IRL, the entire problem of imitation learning often uses specific methods to model humans, like “value iteration networks.” This is the sort of development I’m thinking of that helps AI do a better job at generalizing human values—I’m not sure if you meant things at a lower level, like using a different gradient descent optimization algorithm.