It seems like the natural way to address value learning is to have beliefs about what is really valuable, e.g. by having some distribution over normalized utility functions and maximizing E[U] over both empirical and moral uncertainty.
In that case we are literally incapable of distorting results (just like we are incapable of changing physical facts by managing the news), and we will reason about VOI in the correct way. I have never understood what about the Bayesian approach was unsuitable. Of course it has many of its own difficulties, but I don’t think you’ve resolved any of them. Instead you get a whole heap of extra problems from giving up on a principled and well-understood approach to learning and replacing it with something ad hoc.
I’m also confused about what this agent actually does (but I might just be overlooking something).
You write “U = …” a bunch of times, but then you talk about an agent whose utility is completely different from that, i.e. an agent that cares about “adding a constant” to the definition of U. That’s obviously not what a U-maximizer would do. Instead the agent seems to have U = v + C, where C is a compensatory term defined abstractly as the sum of all future adjustments produced by the indifference formula.
I guess this C is defined with respect to the agent’s current beliefs, conditioned on the events leading up to the compensation (defining it with respect to their beliefs at the time the compensation occurs seems unworkable). But at that point can’t we just collapse the double expectations E[E[w|v-->w]] = E[w|v-->w]? And at that point we can just write the entire expression as E[v|v-->v-->v--->v...], which seems both more correct and much simpler.
Moreover, I don’t yet see why to bother with all of this machinery. We have some AI whose values might change in some way. You seem to be saying “just give the AI a prior probability of 99.99999% that each change won’t actually happen, even though they really will.” As far as I can tell, all of the intuitive objections against this kind of wildly false belief also apply to this kind of surgically modified values (e.g. the AI will still make all of the same implausible inferences from its implausible premise).
One difference is that this approach only requires being able to surgically alter utility functions rather than beliefs. But you need to be able to specify the events you care about in your AI’s model of the world, and at that point it seems like those two operations are totally equivalent.
It seems like the natural way to address value learning is to have beliefs about what is really valuable, e.g. by having some distribution over normalized utility functions and maximizing E[U] over both empirical and moral uncertainty.
This can go disastrously wrong. We lack a good update rule for moral uncertainty. Suppose the rule is “X is bad iff a human says its bad”. Then killing all humans prevents the AI from ever concluding X is bad, which might be something it desires. See sophisticated cake or death for another view of the problem: http://lesswrong.com/lw/f3v/cake_or_death/
In that case we are literally incapable of distorting results (just like we are incapable of changing physical facts by managing the news)
Moral facts are not physical facts. We want something like “X is bad if humans would have said X is bad, freely, unpressured and unmanipulated”, but then we have to define “freely, unpressured and unmanipulated”.
You seem to be saying “just give the AI a prior probability of 99.99999% that each change won’t actually happen, even though they really will.” As far as I can tell, all of the intuitive objections against this kind of wildly false belief also apply to this kind of surgically modified values (e.g. the AI will still make all of the same implausible inferences from its implausible premise).
It has no incorrect beliefs about the world. It is fully aware that the changes are likely to happen, but it’s meta utility causes it to ignore this fact—it cannot gain anything by using its knowledge of that probability.
This can go disastrously wrong. We lack a good update rule for moral uncertainty. Suppose the rule is “X is bad iff a human says its bad”. Then killing all humans prevents the AI from ever concluding X is bad, which might be something it desires. See sophisticated cake or death for another view of the problem: http://lesswrong.com/lw/f3v/cake_or_death/
Again, assuming you can’t make the inference from “X is bad if people say X is bad” and “people probably say X is bad” to “X is probably bad.” But this is a very simple and important form of inference that almost all practical systems would make. I don’t see why you would try to get rid of it!
Also, I agree we lack a good framework for preference learning. But I don’t understand why that leads you to say “and so we should ignore the standard machinery for probabilistic reasoning,” given that we also don’t have any good framework for preference learning that works by ignoring probabilities.
Moral facts are not physical facts.
A Bayesian is incapable of distorting any facts by managing the news, except for facts which actually depend on the news.
We want something like “X is bad if humans would have said X is bad, freely, unpressured and unmanipulated”, but then we have to define “freely, unpressured and unmanipulated”.
The natural approach is to build a model where “humans don’t want X” causes “humans say X is bad.” In even a rudimentary model of this form (of the kind that we can build today), pressure or manipulation will then screen off the inference from human utterances to human preferences.
Is there any plausible approach to value learning that doesn’t capture this kind of inference? I think this is one of the points where MIRI and the mainstream academic community are in agreement (though MIRI expects this will be really tough).
It has no incorrect beliefs about the world. It is fully aware that the changes are likely to happen, but it’s meta utility causes it to ignore this fact—it cannot gain anything by using its knowledge of that probability.
I brought this up in the post on probability vs utility. So far you haven’t pointed to any situation where these two possibilities do anything different. If they do the same thing, and one of them is easier to understand and has been discussed at some length, it seems like we should talk about the one that is easier to understand.
In even a rudimentary model of this form (of the kind that we can build today), pressure or manipulation will then screen off the inference from human utterances to human preferences.
This seems surprising to me, because I think a model that is able to determine the level of ‘pressure’ and ‘manipulation’ present in a situation is not rudimentary. That is, yes, if I have a model where “my preferences” have a causal arrow to “my utterances,” and the system can recognize that it’s intervening at “my utterances” then it can’t infer readily about “my preferences.” But deciding where an intervention is intervening in the graph may be difficult, especially when the thing being modeled is a person’s mind.
Yes, we can’t build models today that reliably make these kinds of inferences. But if we consider a model which is architecturally identical, yet improved far enough to make good predictions, it seems like it would be able to make this kind of inference.
As Stuart points out, the hard part is pointing to the part of the model that you want to access. But for that you don’t have to define “freely, unpressured and unmanipulated.” For example, it would be sufficient to describe any environment that is free of pressure, rather than defining pressure in a precise way.
It seems like the natural way to address value learning is to have beliefs about what is really valuable, e.g. by having some distribution over normalized utility functions and maximizing E[U] over both empirical and moral uncertainty.
In that case we are literally incapable of distorting results (just like we are incapable of changing physical facts by managing the news), and we will reason about VOI in the correct way. I have never understood what about the Bayesian approach was unsuitable. Of course it has many of its own difficulties, but I don’t think you’ve resolved any of them. Instead you get a whole heap of extra problems from giving up on a principled and well-understood approach to learning and replacing it with something ad hoc.
I’m also confused about what this agent actually does (but I might just be overlooking something).
You write “U = …” a bunch of times, but then you talk about an agent whose utility is completely different from that, i.e. an agent that cares about “adding a constant” to the definition of U. That’s obviously not what a U-maximizer would do. Instead the agent seems to have U = v + C, where C is a compensatory term defined abstractly as the sum of all future adjustments produced by the indifference formula.
I guess this C is defined with respect to the agent’s current beliefs, conditioned on the events leading up to the compensation (defining it with respect to their beliefs at the time the compensation occurs seems unworkable). But at that point can’t we just collapse the double expectations E[E[w|v-->w]] = E[w|v-->w]? And at that point we can just write the entire expression as E[v|v-->v-->v--->v...], which seems both more correct and much simpler.
Moreover, I don’t yet see why to bother with all of this machinery. We have some AI whose values might change in some way. You seem to be saying “just give the AI a prior probability of 99.99999% that each change won’t actually happen, even though they really will.” As far as I can tell, all of the intuitive objections against this kind of wildly false belief also apply to this kind of surgically modified values (e.g. the AI will still make all of the same implausible inferences from its implausible premise).
One difference is that this approach only requires being able to surgically alter utility functions rather than beliefs. But you need to be able to specify the events you care about in your AI’s model of the world, and at that point it seems like those two operations are totally equivalent.
This can go disastrously wrong. We lack a good update rule for moral uncertainty. Suppose the rule is “X is bad iff a human says its bad”. Then killing all humans prevents the AI from ever concluding X is bad, which might be something it desires. See sophisticated cake or death for another view of the problem: http://lesswrong.com/lw/f3v/cake_or_death/
Moral facts are not physical facts. We want something like “X is bad if humans would have said X is bad, freely, unpressured and unmanipulated”, but then we have to define “freely, unpressured and unmanipulated”.
It has no incorrect beliefs about the world. It is fully aware that the changes are likely to happen, but it’s meta utility causes it to ignore this fact—it cannot gain anything by using its knowledge of that probability.
Again, assuming you can’t make the inference from “X is bad if people say X is bad” and “people probably say X is bad” to “X is probably bad.” But this is a very simple and important form of inference that almost all practical systems would make. I don’t see why you would try to get rid of it!
Also, I agree we lack a good framework for preference learning. But I don’t understand why that leads you to say “and so we should ignore the standard machinery for probabilistic reasoning,” given that we also don’t have any good framework for preference learning that works by ignoring probabilities.
A Bayesian is incapable of distorting any facts by managing the news, except for facts which actually depend on the news.
The natural approach is to build a model where “humans don’t want X” causes “humans say X is bad.” In even a rudimentary model of this form (of the kind that we can build today), pressure or manipulation will then screen off the inference from human utterances to human preferences.
Is there any plausible approach to value learning that doesn’t capture this kind of inference? I think this is one of the points where MIRI and the mainstream academic community are in agreement (though MIRI expects this will be really tough).
I brought this up in the post on probability vs utility. So far you haven’t pointed to any situation where these two possibilities do anything different. If they do the same thing, and one of them is easier to understand and has been discussed at some length, it seems like we should talk about the one that is easier to understand.
This seems surprising to me, because I think a model that is able to determine the level of ‘pressure’ and ‘manipulation’ present in a situation is not rudimentary. That is, yes, if I have a model where “my preferences” have a causal arrow to “my utterances,” and the system can recognize that it’s intervening at “my utterances” then it can’t infer readily about “my preferences.” But deciding where an intervention is intervening in the graph may be difficult, especially when the thing being modeled is a person’s mind.
Yes, we can’t build models today that reliably make these kinds of inferences. But if we consider a model which is architecturally identical, yet improved far enough to make good predictions, it seems like it would be able to make this kind of inference.
As Stuart points out, the hard part is pointing to the part of the model that you want to access. But for that you don’t have to define “freely, unpressured and unmanipulated.” For example, it would be sufficient to describe any environment that is free of pressure, rather than defining pressure in a precise way.