Unfortunately, ideas like adversarial examples, treacherous turns or nonsentient uploads show that we shouldn’t bet our future on something that imitates a particular black box, even if the imitation passes many tests. We need to understand how the black box works inside, to make sure our version’s behavior is not only similar but based on the right reasons.
Which of these (adversarial examples, treacherous turns or nonsentient uploads) would be the most compelling for a mainstream machine learning researcher?
I gave some reasons here for why the adversarial examples objection might not be super compelling. Though it could be a useful for imparting a general sense of “we don’t actually know how deep learning works”.
I don’t think treacherous turns are a great response, because a treacherous turn scenario assumes an agent style AI with the wrong value function, and the question is why we should expect “just use ML to learn ethics” to produce the wrong value function in the first place.
In addition to being weird, it’s not clear to me how much of a practical concern nonsentient uploads are; presumably it’s possible to create an AI system that commits a pivotal act without any large-scale ancestor simulations.
You might cite work on the impossibility of extracting the human value function using inverse reinforcement learning, but inverse reinforcement learning isn’t the only method available.
You might cite the recent mesa-optimization paper, but that paper barely mentions supervised learning, which seems like the most natural way to “just use ML to learn ethics”.
You might argue that doing a good job of learning human values is just too difficult, and you’ll learn an imperfect approximation of human values that will be vulnerable to Goodhart’s law. However, Superintelligence mentions that each human brain develops its own idiosyncratic representations of higher-level content, yet that doesn’t seem to present an insurmountable problem to us learning one anothers’ values. In any case, this answer might not be the most desirable because it makes AI safety seem like a capabilities problem: “OK, so what you’re saying is that we need a model which is highly accurate on a human ethics dataset. Well, accuracy on ImageNet is increasing every year, so we are making progress.”
So I don’t know if there’s a good snappy comeback to “just use ML to learn ethics”, but I’d love to hear it if it’s out there.
Some quick thoughts:
In Beware of black boxes in AI alignment research, cousin_it wrote:
Which of these (adversarial examples, treacherous turns or nonsentient uploads) would be the most compelling for a mainstream machine learning researcher?
I gave some reasons here for why the adversarial examples objection might not be super compelling. Though it could be a useful for imparting a general sense of “we don’t actually know how deep learning works”.
I don’t think treacherous turns are a great response, because a treacherous turn scenario assumes an agent style AI with the wrong value function, and the question is why we should expect “just use ML to learn ethics” to produce the wrong value function in the first place.
In addition to being weird, it’s not clear to me how much of a practical concern nonsentient uploads are; presumably it’s possible to create an AI system that commits a pivotal act without any large-scale ancestor simulations.
You might cite work on the impossibility of extracting the human value function using inverse reinforcement learning, but inverse reinforcement learning isn’t the only method available.
You might cite the recent mesa-optimization paper, but that paper barely mentions supervised learning, which seems like the most natural way to “just use ML to learn ethics”.
You might argue that doing a good job of learning human values is just too difficult, and you’ll learn an imperfect approximation of human values that will be vulnerable to Goodhart’s law. However, Superintelligence mentions that each human brain develops its own idiosyncratic representations of higher-level content, yet that doesn’t seem to present an insurmountable problem to us learning one anothers’ values. In any case, this answer might not be the most desirable because it makes AI safety seem like a capabilities problem: “OK, so what you’re saying is that we need a model which is highly accurate on a human ethics dataset. Well, accuracy on ImageNet is increasing every year, so we are making progress.”
So I don’t know if there’s a good snappy comeback to “just use ML to learn ethics”, but I’d love to hear it if it’s out there.