A value learning AI on the other hand would take use of the word ‘friendly’ as a clue about a hidden thing that it cares about. This means if the value learning AI could trick the person into saying ‘friendly’ more, this would be no help to it
That’s great, but do any of these approaches actually accomplish this? I still have some reading to do, but as best as I can tell, they all rely on some training data. Like a person shouting “Friendly!” and “Unfriendly!” at different things.
The AI will then just do the thing it thinks would make the person most likely to shout “Friendly!”. E.g. torturing them unless they say it repeatedly.
It seems to me that the only advantage of this approach is that it prevents the AI from having any kind of long-term plans. The AI only cares about how much it’s “next action” will please it’s creator. It doesn’t care about anything that happens 50 steps from now.
Essentially we make the AI really really lazy. Maybe it wants convert the Earth to paperclips, but it never feels like working on it.
This isn’t an entirely bad idea. It would mean we could create an “oracle” AI which just answers questions, based on how likely we are to like the answer. We then have some guarantee that it doesn’t care about manipulating the outside world or escaping from it’s box.
I think the difference is between writing an algorithm that detects the sound of a human saying “Friendly!” (which we can sort-of do today), and writing an algorithm that detects situations where some impartial human observer would tell you that the situation is “Friendly!” if asked about it. (I don’t propose that this is the criteria that should be used, but your algorithm needs at least that level of algorithmic sophistication for value learning). The situation you talk about will always happen with the first sort of algorithm. The second sort of algorithm could work, although lack of training data might lead to it functionally behaving in the same way as the first, or to making a similar class of mistakes.
I don’t see a distinction between these things. Shouting “Friendly!” is just the mechanism being used to add to the training data.
No matter what method you use to label the data, there is no way for the machine to distinguish it from ground truth.
E.g. the machine might learn that it should convince you to press the reward button, but it might also learn to steal the button and press it itself.
Both are perfectly valid generalizations to the problem of “predict what actions are the most likely to lead to a positive example in the training set.” But only one is what we really intend.
If the AI takes your saying ‘friendly’ to be a consequence of something being a positive example, then it doesn’t think changing your words manually will change whether it is a positive example. If it thinks your actions cause something to be a positive example, then it does think changing your actions will change whether it is a positive example.
Shouting “Friendly!” isn’t just correlated with positive examples, it literally causes them. Torturing the supervisor to make them say “Friendly!” is a perfectly valid generalization of the training set. Unless you include negative examples of that, and all the countless other ways it can go wrong.
It causes something to be a training example, but it doesn’t cause it to be an instance of the thing the AI is meant to identify. If the AI itself has this model (in which there is something else it cares about, which is often identified by shouting), then we should not get the problem you mention.
In particular, the value learning scheme—where the AI has priors over what is valuable and its observations cause it to update these—should avoid the problem, if I understand correctly.
Imagine a simple reinforcement learner. I press a button and it gets a reward. If the reinforcement learner is smart, it will figure out that pressing the button causes the reward, and try to steal the button and press it (as opposed to indirectly pressing it by pleasing me.)
This is the exact same situation. We’ve just removed the reward. Instead the AI tries to predict what actions would have given it rewards. However there is no difference between predicted rewards and actual rewards. They should converge to the same function, that’s the entire goal of the learning.
So if the AI is as smart as the AI in the first scenario, it will know that stealing the reward button is what it should have done the first time around, and therefore what it will do the second time.
Expecting the AI to magically learn human values and stop there is just absurdly anthropomorphically optimistic.
That’s great, but do any of these approaches actually accomplish this? I still have some reading to do, but as best as I can tell, they all rely on some training data. Like a person shouting “Friendly!” and “Unfriendly!” at different things.
The AI will then just do the thing it thinks would make the person most likely to shout “Friendly!”. E.g. torturing them unless they say it repeatedly.
Yudkowsky argues against a very similar idea here.
It seems to me that the only advantage of this approach is that it prevents the AI from having any kind of long-term plans. The AI only cares about how much it’s “next action” will please it’s creator. It doesn’t care about anything that happens 50 steps from now.
Essentially we make the AI really really lazy. Maybe it wants convert the Earth to paperclips, but it never feels like working on it.
This isn’t an entirely bad idea. It would mean we could create an “oracle” AI which just answers questions, based on how likely we are to like the answer. We then have some guarantee that it doesn’t care about manipulating the outside world or escaping from it’s box.
I think the difference is between writing an algorithm that detects the sound of a human saying “Friendly!” (which we can sort-of do today), and writing an algorithm that detects situations where some impartial human observer would tell you that the situation is “Friendly!” if asked about it. (I don’t propose that this is the criteria that should be used, but your algorithm needs at least that level of algorithmic sophistication for value learning). The situation you talk about will always happen with the first sort of algorithm. The second sort of algorithm could work, although lack of training data might lead to it functionally behaving in the same way as the first, or to making a similar class of mistakes.
I don’t see a distinction between these things. Shouting “Friendly!” is just the mechanism being used to add to the training data.
No matter what method you use to label the data, there is no way for the machine to distinguish it from ground truth.
E.g. the machine might learn that it should convince you to press the reward button, but it might also learn to steal the button and press it itself.
Both are perfectly valid generalizations to the problem of “predict what actions are the most likely to lead to a positive example in the training set.” But only one is what we really intend.
If the AI takes your saying ‘friendly’ to be a consequence of something being a positive example, then it doesn’t think changing your words manually will change whether it is a positive example. If it thinks your actions cause something to be a positive example, then it does think changing your actions will change whether it is a positive example.
Shouting “Friendly!” isn’t just correlated with positive examples, it literally causes them. Torturing the supervisor to make them say “Friendly!” is a perfectly valid generalization of the training set. Unless you include negative examples of that, and all the countless other ways it can go wrong.
It causes something to be a training example, but it doesn’t cause it to be an instance of the thing the AI is meant to identify. If the AI itself has this model (in which there is something else it cares about, which is often identified by shouting), then we should not get the problem you mention.
In particular, the value learning scheme—where the AI has priors over what is valuable and its observations cause it to update these—should avoid the problem, if I understand correctly.
Imagine a simple reinforcement learner. I press a button and it gets a reward. If the reinforcement learner is smart, it will figure out that pressing the button causes the reward, and try to steal the button and press it (as opposed to indirectly pressing it by pleasing me.)
This is the exact same situation. We’ve just removed the reward. Instead the AI tries to predict what actions would have given it rewards. However there is no difference between predicted rewards and actual rewards. They should converge to the same function, that’s the entire goal of the learning.
So if the AI is as smart as the AI in the first scenario, it will know that stealing the reward button is what it should have done the first time around, and therefore what it will do the second time.
Expecting the AI to magically learn human values and stop there is just absurdly anthropomorphically optimistic.