The power that it has comes from our knowledge about the world that we have encoded into it.
This knowledge could also come from other sources, e.g. transfer learning.
We know that human children are capable of generalizing from many fewer examples than ML algorithms. That suggests human brains are fundamentally better at learning in some sense. I think we’ll be able to replicate this capability before we get to AGI.
For value learning, we want the AI to have a very specific sort of generalization skill when it comes to humans. It has to not only predict human actions, it has to make a very particular sort of generalization (“human values”), and single out part of that generalization to make plans with.
As Ian Goodfellow puts it, machine learning people have already been working on alignment for decades. If alignment is “learning and respecting human preferences”, object recognition is “learning human preferences about how to categorize images”, and sentiment analysis is “learning human preferences about how to categorize sentences”.
I’ve never heard anyone in machine learning divide the field into cases where we’re trying to generalize about human values and cases where we aren’t. It seems like the same set of algorithms, tricks, etc. work either way.
Suppose that you are given a problem to solve, I don’t care what kind of a problem — a machine to design, or a physical theory to develop, or a mathematical theorem to prove, or something of that kind — probably a very powerful approach to this is to attempt to eliminate everything from the problem except the essentials; that is, cut it down to size. Almost every problem that you come across is befuddled with all kinds of extraneous data of one sort or another; and if you can bring this problem down into the main issues, you can see more clearly what you’re trying to do and perhaps find a solution. Now, in so doing, you may have stripped away the problem that you’re after. You may have simplified it to a point that it doesn’t even resemble the problem that you started with; but very often if you can solve this simple problem, you can add refinements to the solution of this until you get back to the solution of the one you started with.
In other words, I think trying to find the “essential core” of a problem is a good problem-solving strategy, including for a problem like friendliness. I have yet to see a non-handwavey argument against the idea that generalization is the “essential core” of friendliness.
The information to pick out one particular generalization rather than another has to come from humans doing hard, complicated work, even if it gets encoded into the algorithm.
I actually think the work humans do can be straightforward and easy. Something like: Have the system find every possible generalization which seems reasonable, then synthesize examples those generalizations disagree on. Keep asking about the humans about those synthesized examples until you’ve narrowed down the number of possible generalizations the human plausibly wants to the point where you can be reasonably confident about the human’s desired behavior in a particular circumstance.
I think this sort of approach is typically referred to as “active learning” or “machine teaching” by ML practitioners. But it’s not too different from the procedure that you would use to learn about someone’s values if you were visiting a foreign country.
Ah, but I don’t trust humans to be a trusted source when it comes to what an AI should do with the future lightcone. I expect you’d run into something like Scott talks about in The Tails Coming Apart As Metaphor For Life, where humans are making unprincipled and contradictory statements, with not at all enough time spent thinking about the problem.
As Ian Goodfellow puts it, machine learning people have already been working on alignment for decades. If alignment is “learning and respecting human preferences”, object recognition is “human preferences about how to categorize images”, and sentiment analysis is “human preferences about how to categorize sentences”
I somewhat agree, but you could equally well call them “learning human behavior at categorizing images,” “learning human behavior at categorizing sentences,” etc. I don’t think that’s enough. If we build an AI that does exactly what a human would do in that situation (or what action they would choose as correct when assembling a training set), I would consider that a failure.
So this is two separate problems: one, I think humans can’t reliably tell an AI what they value through a text channel, even with prompting, and two, I think that mimicking human behavior, even human behavior on moral questions, is insufficient to deal with the possibilities of the future.
I’ve never heard anyone in machine learning divide the field into cases where we’re trying to generalize about human values and cases where we aren’t. It seems like the same set of algorithms, tricks, etc. work either way.
It also sounds silly to say that one can divide the field into cases where you’re doing model-based reinforcement learning, and cases where you aren’t. The point isn’t the division, it’s that model-based reinforcement learning is solving a specific type of problem.
Let me take another go at the distinction: Suppose you have a big training set of human answers to moral questions. There are several different things you could mean by “generalize well” in this case, which correspond to solving different problems.
The first kind of “generalize well” is where the task is to predict moral answers drawn from the same distribution as the training set. This is what most of the field is doing right now for Ian Goodfellow’s examples of categorizing images or categorizing sentences. The better we get at generalizing in this sense, the more reproducing the training set corresponds to reproducing the test set.
Another sort of “generalize well” might be inferring a larger “real world” distribution even when the training set is limited. For example, if you’re given labeled data for handwritten digits 0-20 into binary outputs, can you give the correct binary output for 21? How about 33? In our moral questions example, this would be like predicting answers to moral questions spawned by novel situations not seen in training. The better we get at generalizing in this sense, the more reproducing the training set corresponds to reproducing examples later drawn from the real world.
Let’s stop here for a moment and point out that if we want generalization in the second sense, algorithmic advances in the first sense might be useful, but they aren’t sufficient. For the classifier to output the binary for 33, it probably has to be deliberately designed to learn flexible representations, and probably get fed some additional information (e.g. by transfer learning). When the training distribution and the “real world” distribution are different, you’re solving a different problem than when they’re the same.
A third sort of “generalize well” is to learn superhumanly skilled answers even if the training data is flawed or limited. Think of an agent that learns to play Atari games at a superhuman level, from human demonstrations. This generalization task often involves filling in a complex model of the human “expert,” along with learning about the environment—for current examples, the model of the human is usually hand-written. The better we get at generalizing in this way, the more the AI’s answers will be like “what we meant” (either by some metric we kept hidden from the AI, or in some vague intuitive sense) even if they diverge from what humans would answer.
(I’m sure there are more tasks that fall under the umbrella of “generalization,” but you’ll have to suggest them yourself :) )
So while I’d say that value learning involves generalization, I think that generalization can mean a lot of different tasks—a rising tide of type 1 generalization (which is the mathematically simple kind) won’t lift all boats.
Ah, but I don’t trust humans to be a trusted source when it comes to what an AI should do with the future lightcone.
First, let’s acknowledge that this is a new objection you are raising which we haven’t discussed yet, eh? I’m tempted to say “moving the goalposts”, but I want to hear your best objections wherever they come from; I just want you to acknowledge that this is in fact a new objection :)
I expect you’d run into something like Scott talks about in The Tails Coming Apart As Metaphor For Life, where humans are making unprincipled and contradictory statements, with not at all enough time spent thinking about the problem.
Scott is describing distributional shift in that essay. Here’s a quote:
The further we go toward the tails, the more extreme the divergences become. Utilitarianism agrees that we should give to charity and shouldn’t steal from the poor, because Utility, but take it far enough to the tails and we should tile the universe with rats on heroin. Religious morality agrees that we should give to charity and shouldn’t steal from the poor, because God, but take it far enough to the tails and we should spend all our time in giant cubes made of semiprecious stones singing songs of praise. Deontology agrees that we should give to charity and shouldn’t steal from the poor, because Rules, but take it far enough to the tails and we all have to be libertarians.
The “distribution” is the set of moral questions that we find ourselves pondering in our everyday lives. Each moral theory (Utilitarianism, religious morality, etc.) is an attempt to make sense of our moral intuitions in a variety of different situations and “fit a curve” through them somehow. The trouble comes when we start considering unusual “off-distribution” moral situations and asking what our moral intuitions say in those situations.
So this isn’t actually a different problem. As Shannon said, once you pare away the extraneous data, you get a simplified problem which represents the core of what needs to be accomplished.
humans are making unprincipled and contradictory statements, with not at all enough time spent thinking about the problem.
Yep. I address this in this comment; search for “The problem is that the overseer has insufficient time to reflect on their true values.”
I somewhat agree, but you could equally well call them “learning human behavior at categorizing images,” “learning human behavior at categorizing sentences,” etc.
Sure, so we just have to learn human behavior at categorizing desired/undesired behavior from our AGI. Approval-direction, essentially.
If we build an AI that does exactly what a human would do in that situation (or what action they would choose as correct when assembling a training set), I would consider that a failure.
If the subject is Paul Christiano, or Carl Shulman, I for one am willing to say these humans are reasonably aligned; and I’m pretty much okay with somebody giving them the keys to the universe in expectation that the keys will later be handed back.
Do you agree or disagree with Eliezer? (In other words, do you think a high-fidelity upload of a benevolent person will result in a good outcome?)
If you disagree, it seems that we have no hope at success whatsoever. If no human can be trusted to act, and AGI is going to arise through our actions, then we can’t be trusted to build it right. So we might as well just give up now.
It also sounds silly to say that one can divide the field into cases where you’re doing model-based reinforcement learning, and cases where you aren’t. The point isn’t the division, it’s that model-based reinforcement learning is solving a specific type of problem.
Sure. So my point is, so far, it hasn’t really proven useful to develop methods to generalize specifically in the case where we are learning human preferences. We haven’t really needed to develop special methods to solve this specific type of problem. (Correct me if I’m wrong.) So this all suggests that it isn’t actually a different problem, fundamentally speaking.
By the way, everything I’ve been saying is about supervised learning, not RL.
I agree with the rest of your comment. I’m focused on the second kind of generalization. As you say, work on the first kind may or may not be useful. I think you can get from the second kind (correctly replicating human labels) to the third kind (“superhuman” labels that the overseer wishes they had thought of themselves) based on active learning, as I described earlier.
“I don’t trust humans to be a trusted source when it comes to what an AI should do with the future lightcone.”
First, let’s acknowledge that this is a new objection you are raising which we haven’t discussed yet, eh? I’m tempted to say “moving the goalposts”, but I want to hear your best objections wherever they come from; I just want you to acknowledge that this is in fact a new objection :)
Sure :) I’ve said similar things elsewhere, but I suppose one must sometimes talk to people who haven’t read one’s every word :P
We’re being pretty vague in describing the human-AI interaction here, but I agree that one reason why the AI shouldn’t just do what it would predict humans would tell it to do (or, if below some threshold of certainty, ask a human) is that humans are not immune to distributional shift.
There are also systematic factors, like preserving your self-image, that sometimes make humans say really dumb things about far-off situations because of more immediate concerns.
Lastly, figuring out what the AI should do with its resources is really hard, and figuring out which to call “better” between two complicated choices can be hard too, and humans will sometimes do badly at it. Worst case, the humans appear to answer hard questions with certainty, or conversely the questions the AI is most uncertain about slowly devolve into giving humans hard questions and treating their answers as strong information.
I think the AI should actively take this stuff into account rather than trying to stay in some context where it can unshakeably trust humans. And by “take this into account,” I’m pretty sure that means model the human and treat preferences as objects in the model.
Skipping over the intervening stuff I agree with, here’s that Eliezer quote:
Eliezer Yudkowsky wrote: “If the subject is Paul Christiano, or Carl Shulman, I for one am willing to say these humans are reasonably aligned; and I’m pretty much okay with somebody giving them the keys to the universe in expectation that the keys will later be handed back.”
Do you agree or disagree with Eliezer? (In other words, do you think a high-fidelity upload of a benevolent person will result in a good outcome?)
If you disagree, it seems that we have no hope at success whatsoever. If no human can be trusted to act, and AGI is going to arise through our actions, then we can’t be trusted to build it right. So we might as well just give up now.
I think Upload Paul Christiano would just go on to work on the alignment problem, which might be useful but is definitely passing the buck.
Though I’m not sure. Maybe Upload Paul Christiano would be capable of taking over the world and handling existential threats before swiftly solving the alignment problem. Then it doesn’t really matter if it’s passing the buck or not.
But my original thought wasn’t about uploads (though that’s definitely a reasonable way to interpret my sentence), it was about copying human decision-making behavior in the same sense that an image classifier copies human image-classifying behavior.
Though maybe you went in the right direction anyhow, and if all you had was supervised learning the right thing to do is to try to copy the decision-making of a single person (not an upload, a sideload). What was that Greg Egan book—Zendegi?
so far, it hasn’t really proven useful to develop methods to generalize specifically in the case where we are learning human preferences. We haven’t really needed to develop special methods to solve this specific type of problem. (Correct me if I’m wrong.)
There are some cases where the AI specifically has a model of the human, and I’d call those “special methods.” Not just IRL, the entire problem of imitation learning often uses specific methods to model humans, like “value iteration networks.” This is the sort of development I’m thinking of that helps AI do a better job at generalizing human values—I’m not sure if you meant things at a lower level, like using a different gradient descent optimization algorithm.
This knowledge could also come from other sources, e.g. transfer learning.
We know that human children are capable of generalizing from many fewer examples than ML algorithms. That suggests human brains are fundamentally better at learning in some sense. I think we’ll be able to replicate this capability before we get to AGI.
As Ian Goodfellow puts it, machine learning people have already been working on alignment for decades. If alignment is “learning and respecting human preferences”, object recognition is “learning human preferences about how to categorize images”, and sentiment analysis is “learning human preferences about how to categorize sentences”.
I’ve never heard anyone in machine learning divide the field into cases where we’re trying to generalize about human values and cases where we aren’t. It seems like the same set of algorithms, tricks, etc. work either way.
BTW, Claude Shannon once wrote:
In other words, I think trying to find the “essential core” of a problem is a good problem-solving strategy, including for a problem like friendliness. I have yet to see a non-handwavey argument against the idea that generalization is the “essential core” of friendliness.
I actually think the work humans do can be straightforward and easy. Something like: Have the system find every possible generalization which seems reasonable, then synthesize examples those generalizations disagree on. Keep asking about the humans about those synthesized examples until you’ve narrowed down the number of possible generalizations the human plausibly wants to the point where you can be reasonably confident about the human’s desired behavior in a particular circumstance.
I think this sort of approach is typically referred to as “active learning” or “machine teaching” by ML practitioners. But it’s not too different from the procedure that you would use to learn about someone’s values if you were visiting a foreign country.
Ah, but I don’t trust humans to be a trusted source when it comes to what an AI should do with the future lightcone. I expect you’d run into something like Scott talks about in The Tails Coming Apart As Metaphor For Life, where humans are making unprincipled and contradictory statements, with not at all enough time spent thinking about the problem.
I somewhat agree, but you could equally well call them “learning human behavior at categorizing images,” “learning human behavior at categorizing sentences,” etc. I don’t think that’s enough. If we build an AI that does exactly what a human would do in that situation (or what action they would choose as correct when assembling a training set), I would consider that a failure.
So this is two separate problems: one, I think humans can’t reliably tell an AI what they value through a text channel, even with prompting, and two, I think that mimicking human behavior, even human behavior on moral questions, is insufficient to deal with the possibilities of the future.
It also sounds silly to say that one can divide the field into cases where you’re doing model-based reinforcement learning, and cases where you aren’t. The point isn’t the division, it’s that model-based reinforcement learning is solving a specific type of problem.
Let me take another go at the distinction: Suppose you have a big training set of human answers to moral questions. There are several different things you could mean by “generalize well” in this case, which correspond to solving different problems.
The first kind of “generalize well” is where the task is to predict moral answers drawn from the same distribution as the training set. This is what most of the field is doing right now for Ian Goodfellow’s examples of categorizing images or categorizing sentences. The better we get at generalizing in this sense, the more reproducing the training set corresponds to reproducing the test set.
Another sort of “generalize well” might be inferring a larger “real world” distribution even when the training set is limited. For example, if you’re given labeled data for handwritten digits 0-20 into binary outputs, can you give the correct binary output for 21? How about 33? In our moral questions example, this would be like predicting answers to moral questions spawned by novel situations not seen in training. The better we get at generalizing in this sense, the more reproducing the training set corresponds to reproducing examples later drawn from the real world.
Let’s stop here for a moment and point out that if we want generalization in the second sense, algorithmic advances in the first sense might be useful, but they aren’t sufficient. For the classifier to output the binary for 33, it probably has to be deliberately designed to learn flexible representations, and probably get fed some additional information (e.g. by transfer learning). When the training distribution and the “real world” distribution are different, you’re solving a different problem than when they’re the same.
A third sort of “generalize well” is to learn superhumanly skilled answers even if the training data is flawed or limited. Think of an agent that learns to play Atari games at a superhuman level, from human demonstrations. This generalization task often involves filling in a complex model of the human “expert,” along with learning about the environment—for current examples, the model of the human is usually hand-written. The better we get at generalizing in this way, the more the AI’s answers will be like “what we meant” (either by some metric we kept hidden from the AI, or in some vague intuitive sense) even if they diverge from what humans would answer.
(I’m sure there are more tasks that fall under the umbrella of “generalization,” but you’ll have to suggest them yourself :) )
So while I’d say that value learning involves generalization, I think that generalization can mean a lot of different tasks—a rising tide of type 1 generalization (which is the mathematically simple kind) won’t lift all boats.
First, let’s acknowledge that this is a new objection you are raising which we haven’t discussed yet, eh? I’m tempted to say “moving the goalposts”, but I want to hear your best objections wherever they come from; I just want you to acknowledge that this is in fact a new objection :)
Scott is describing distributional shift in that essay. Here’s a quote:
The “distribution” is the set of moral questions that we find ourselves pondering in our everyday lives. Each moral theory (Utilitarianism, religious morality, etc.) is an attempt to make sense of our moral intuitions in a variety of different situations and “fit a curve” through them somehow. The trouble comes when we start considering unusual “off-distribution” moral situations and asking what our moral intuitions say in those situations.
So this isn’t actually a different problem. As Shannon said, once you pare away the extraneous data, you get a simplified problem which represents the core of what needs to be accomplished.
Yep. I address this in this comment; search for “The problem is that the overseer has insufficient time to reflect on their true values.”
Sure, so we just have to learn human behavior at categorizing desired/undesired behavior from our AGI. Approval-direction, essentially.
Eliezer Yudkowsky wrote:
Do you agree or disagree with Eliezer? (In other words, do you think a high-fidelity upload of a benevolent person will result in a good outcome?)
If you disagree, it seems that we have no hope at success whatsoever. If no human can be trusted to act, and AGI is going to arise through our actions, then we can’t be trusted to build it right. So we might as well just give up now.
Sure. So my point is, so far, it hasn’t really proven useful to develop methods to generalize specifically in the case where we are learning human preferences. We haven’t really needed to develop special methods to solve this specific type of problem. (Correct me if I’m wrong.) So this all suggests that it isn’t actually a different problem, fundamentally speaking.
By the way, everything I’ve been saying is about supervised learning, not RL.
I agree with the rest of your comment. I’m focused on the second kind of generalization. As you say, work on the first kind may or may not be useful. I think you can get from the second kind (correctly replicating human labels) to the third kind (“superhuman” labels that the overseer wishes they had thought of themselves) based on active learning, as I described earlier.
Sure :) I’ve said similar things elsewhere, but I suppose one must sometimes talk to people who haven’t read one’s every word :P
We’re being pretty vague in describing the human-AI interaction here, but I agree that one reason why the AI shouldn’t just do what it would predict humans would tell it to do (or, if below some threshold of certainty, ask a human) is that humans are not immune to distributional shift.
There are also systematic factors, like preserving your self-image, that sometimes make humans say really dumb things about far-off situations because of more immediate concerns.
Lastly, figuring out what the AI should do with its resources is really hard, and figuring out which to call “better” between two complicated choices can be hard too, and humans will sometimes do badly at it. Worst case, the humans appear to answer hard questions with certainty, or conversely the questions the AI is most uncertain about slowly devolve into giving humans hard questions and treating their answers as strong information.
I think the AI should actively take this stuff into account rather than trying to stay in some context where it can unshakeably trust humans. And by “take this into account,” I’m pretty sure that means model the human and treat preferences as objects in the model.
Skipping over the intervening stuff I agree with, here’s that Eliezer quote:
I think Upload Paul Christiano would just go on to work on the alignment problem, which might be useful but is definitely passing the buck.
Though I’m not sure. Maybe Upload Paul Christiano would be capable of taking over the world and handling existential threats before swiftly solving the alignment problem. Then it doesn’t really matter if it’s passing the buck or not.
But my original thought wasn’t about uploads (though that’s definitely a reasonable way to interpret my sentence), it was about copying human decision-making behavior in the same sense that an image classifier copies human image-classifying behavior.
Though maybe you went in the right direction anyhow, and if all you had was supervised learning the right thing to do is to try to copy the decision-making of a single person (not an upload, a sideload). What was that Greg Egan book—Zendegi?
There are some cases where the AI specifically has a model of the human, and I’d call those “special methods.” Not just IRL, the entire problem of imitation learning often uses specific methods to model humans, like “value iteration networks.” This is the sort of development I’m thinking of that helps AI do a better job at generalizing human values—I’m not sure if you meant things at a lower level, like using a different gradient descent optimization algorithm.