It seems like the natural way to address value learning is to have beliefs about what is really valuable, e.g. by having some distribution over normalized utility functions and maximizing E[U] over both empirical and moral uncertainty.
This can go disastrously wrong. We lack a good update rule for moral uncertainty. Suppose the rule is “X is bad iff a human says its bad”. Then killing all humans prevents the AI from ever concluding X is bad, which might be something it desires. See sophisticated cake or death for another view of the problem: http://lesswrong.com/lw/f3v/cake_or_death/
In that case we are literally incapable of distorting results (just like we are incapable of changing physical facts by managing the news)
Moral facts are not physical facts. We want something like “X is bad if humans would have said X is bad, freely, unpressured and unmanipulated”, but then we have to define “freely, unpressured and unmanipulated”.
You seem to be saying “just give the AI a prior probability of 99.99999% that each change won’t actually happen, even though they really will.” As far as I can tell, all of the intuitive objections against this kind of wildly false belief also apply to this kind of surgically modified values (e.g. the AI will still make all of the same implausible inferences from its implausible premise).
It has no incorrect beliefs about the world. It is fully aware that the changes are likely to happen, but it’s meta utility causes it to ignore this fact—it cannot gain anything by using its knowledge of that probability.
This can go disastrously wrong. We lack a good update rule for moral uncertainty. Suppose the rule is “X is bad iff a human says its bad”. Then killing all humans prevents the AI from ever concluding X is bad, which might be something it desires. See sophisticated cake or death for another view of the problem: http://lesswrong.com/lw/f3v/cake_or_death/
Again, assuming you can’t make the inference from “X is bad if people say X is bad” and “people probably say X is bad” to “X is probably bad.” But this is a very simple and important form of inference that almost all practical systems would make. I don’t see why you would try to get rid of it!
Also, I agree we lack a good framework for preference learning. But I don’t understand why that leads you to say “and so we should ignore the standard machinery for probabilistic reasoning,” given that we also don’t have any good framework for preference learning that works by ignoring probabilities.
Moral facts are not physical facts.
A Bayesian is incapable of distorting any facts by managing the news, except for facts which actually depend on the news.
We want something like “X is bad if humans would have said X is bad, freely, unpressured and unmanipulated”, but then we have to define “freely, unpressured and unmanipulated”.
The natural approach is to build a model where “humans don’t want X” causes “humans say X is bad.” In even a rudimentary model of this form (of the kind that we can build today), pressure or manipulation will then screen off the inference from human utterances to human preferences.
Is there any plausible approach to value learning that doesn’t capture this kind of inference? I think this is one of the points where MIRI and the mainstream academic community are in agreement (though MIRI expects this will be really tough).
It has no incorrect beliefs about the world. It is fully aware that the changes are likely to happen, but it’s meta utility causes it to ignore this fact—it cannot gain anything by using its knowledge of that probability.
I brought this up in the post on probability vs utility. So far you haven’t pointed to any situation where these two possibilities do anything different. If they do the same thing, and one of them is easier to understand and has been discussed at some length, it seems like we should talk about the one that is easier to understand.
In even a rudimentary model of this form (of the kind that we can build today), pressure or manipulation will then screen off the inference from human utterances to human preferences.
This seems surprising to me, because I think a model that is able to determine the level of ‘pressure’ and ‘manipulation’ present in a situation is not rudimentary. That is, yes, if I have a model where “my preferences” have a causal arrow to “my utterances,” and the system can recognize that it’s intervening at “my utterances” then it can’t infer readily about “my preferences.” But deciding where an intervention is intervening in the graph may be difficult, especially when the thing being modeled is a person’s mind.
Yes, we can’t build models today that reliably make these kinds of inferences. But if we consider a model which is architecturally identical, yet improved far enough to make good predictions, it seems like it would be able to make this kind of inference.
As Stuart points out, the hard part is pointing to the part of the model that you want to access. But for that you don’t have to define “freely, unpressured and unmanipulated.” For example, it would be sufficient to describe any environment that is free of pressure, rather than defining pressure in a precise way.
This can go disastrously wrong. We lack a good update rule for moral uncertainty. Suppose the rule is “X is bad iff a human says its bad”. Then killing all humans prevents the AI from ever concluding X is bad, which might be something it desires. See sophisticated cake or death for another view of the problem: http://lesswrong.com/lw/f3v/cake_or_death/
Moral facts are not physical facts. We want something like “X is bad if humans would have said X is bad, freely, unpressured and unmanipulated”, but then we have to define “freely, unpressured and unmanipulated”.
It has no incorrect beliefs about the world. It is fully aware that the changes are likely to happen, but it’s meta utility causes it to ignore this fact—it cannot gain anything by using its knowledge of that probability.
Again, assuming you can’t make the inference from “X is bad if people say X is bad” and “people probably say X is bad” to “X is probably bad.” But this is a very simple and important form of inference that almost all practical systems would make. I don’t see why you would try to get rid of it!
Also, I agree we lack a good framework for preference learning. But I don’t understand why that leads you to say “and so we should ignore the standard machinery for probabilistic reasoning,” given that we also don’t have any good framework for preference learning that works by ignoring probabilities.
A Bayesian is incapable of distorting any facts by managing the news, except for facts which actually depend on the news.
The natural approach is to build a model where “humans don’t want X” causes “humans say X is bad.” In even a rudimentary model of this form (of the kind that we can build today), pressure or manipulation will then screen off the inference from human utterances to human preferences.
Is there any plausible approach to value learning that doesn’t capture this kind of inference? I think this is one of the points where MIRI and the mainstream academic community are in agreement (though MIRI expects this will be really tough).
I brought this up in the post on probability vs utility. So far you haven’t pointed to any situation where these two possibilities do anything different. If they do the same thing, and one of them is easier to understand and has been discussed at some length, it seems like we should talk about the one that is easier to understand.
This seems surprising to me, because I think a model that is able to determine the level of ‘pressure’ and ‘manipulation’ present in a situation is not rudimentary. That is, yes, if I have a model where “my preferences” have a causal arrow to “my utterances,” and the system can recognize that it’s intervening at “my utterances” then it can’t infer readily about “my preferences.” But deciding where an intervention is intervening in the graph may be difficult, especially when the thing being modeled is a person’s mind.
Yes, we can’t build models today that reliably make these kinds of inferences. But if we consider a model which is architecturally identical, yet improved far enough to make good predictions, it seems like it would be able to make this kind of inference.
As Stuart points out, the hard part is pointing to the part of the model that you want to access. But for that you don’t have to define “freely, unpressured and unmanipulated.” For example, it would be sufficient to describe any environment that is free of pressure, rather than defining pressure in a precise way.