On Discord I was told that my approach resembled the “utility indifference” approach that some people have taken (which for some reason seems to have fallen out of favor?), and I agree that there is some resemblance, though I’m not sure it’s the exact some thing as it doesn’t seem to me that the “utility indifference” people are proposing using the same counterfactuals as I did. But a lot of the utility indifference proposals were highly technical, and I had trouble understanding them, so maybe they did. Regardless, if it is the same thing I hope my writing at least makes the idea more clear and accessible.
One major difference between my approach and the linked approach is that I think it’s better to apply the counterfactuals to human values, rather than to the AI’s values. Also, I think changing the utilities over time is confusing and likely to lead to bugs; if I had to address the problem of value learning, I would do something along the lines of the following:
Pick some distribution A of utility functions that you wish to be aligned with. Then optimize:
U=Ev∼A[v|do(H=v)]
… where H is the model of the human preferences. Implementation-wise, this corresponds to simulating the AI in a variety of environments where human preferences are sampled from A, and then in each environment judging the AI by how well it did according to the preferences sampled in said environment.
This would induce the AI to be sensitive to human preferences, as in order to succeed, it’s policy has to observe human preferences from behavior and adjust accordingly. However, I have a hard time seeing this working in practice, because it’s very dubious that human values (insofar as they exist, which is also questionable) can be inferred from behavior. I’m much more comfortable applying the counterfactual to simple concrete behaviors than to big abstract behaviors, as the former seems more predictable.
On Discord I was told that my approach resembled the “utility indifference” approach that some people have taken (which for some reason seems to have fallen out of favor?), and I agree that there is some resemblance, though I’m not sure it’s the exact some thing as it doesn’t seem to me that the “utility indifference” people are proposing using the same counterfactuals as I did. But a lot of the utility indifference proposals were highly technical, and I had trouble understanding them, so maybe they did. Regardless, if it is the same thing I hope my writing at least makes the idea more clear and accessible.
Also: https://www.lesswrong.com/posts/MiYkTp6QYKXdJbchu/disentangling-corrigibility-2015-2021
Yup, very similar. See e.g. https://www.lesswrong.com/posts/btLPgsGzwzDk9DgJG/proper-value-learning-through-indifference
There’s lots of literature out there.
One major difference between my approach and the linked approach is that I think it’s better to apply the counterfactuals to human values, rather than to the AI’s values. Also, I think changing the utilities over time is confusing and likely to lead to bugs; if I had to address the problem of value learning, I would do something along the lines of the following:
Pick some distribution A of utility functions that you wish to be aligned with. Then optimize:
U=Ev∼A[v|do(H=v)]
… where H is the model of the human preferences. Implementation-wise, this corresponds to simulating the AI in a variety of environments where human preferences are sampled from A, and then in each environment judging the AI by how well it did according to the preferences sampled in said environment.
This would induce the AI to be sensitive to human preferences, as in order to succeed, it’s policy has to observe human preferences from behavior and adjust accordingly. However, I have a hard time seeing this working in practice, because it’s very dubious that human values (insofar as they exist, which is also questionable) can be inferred from behavior. I’m much more comfortable applying the counterfactual to simple concrete behaviors than to big abstract behaviors, as the former seems more predictable.