Yeah, I think this comment is basically right. On nontrivial real-world training data, there are always going to be both good and bad ways to interpret it. At some point you need to argue from inductive biases, and those depend on the AI that’s doing the learning, not just the data.
I think the real distinction between their categories is something like:
Specification gaming: Even on the training distribution, the AI is taking object-level actions that humans think are bad. This can include bad stuff that they don’t notice at the time, but that seems obvious to us humans abstractly reasoning about the hypothetical.
Misgeneralization: On the training distribution, the AI doesn’t take object-level actions that humans think are bad. But then, later, it does.
Specification gaming: Even on the training distribution, the AI is taking object-level actions that humans think are bad. This can include bad stuff that they don’t notice at the time, but that seems obvious to us humans abstractly reasoning about the hypothetical.
Do you mean “the AI is taking object-level actions that humans think are bad while achieving high reward”?
If so, I don’t see how this solves the problem. I still claim that every reward function can be gamed in principle, absent assumptions about the AI in question.
Yeah, I think this comment is basically right. On nontrivial real-world training data, there are always going to be both good and bad ways to interpret it. At some point you need to argue from inductive biases, and those depend on the AI that’s doing the learning, not just the data.
I think the real distinction between their categories is something like:
Specification gaming: Even on the training distribution, the AI is taking object-level actions that humans think are bad. This can include bad stuff that they don’t notice at the time, but that seems obvious to us humans abstractly reasoning about the hypothetical.
Misgeneralization: On the training distribution, the AI doesn’t take object-level actions that humans think are bad. But then, later, it does.
Do you mean “the AI is taking object-level actions that humans think are bad while achieving high reward”?
If so, I don’t see how this solves the problem. I still claim that every reward function can be gamed in principle, absent assumptions about the AI in question.
Sure, something like that.
I agree it doesn’t solve the problem if you don’t use information / assumptions about the AI in question.