Charlie Steiner comments on Clarifying AI X-risk

Charlie Steiner 1 Nov 2022 16:45 UTC
7 points
2
Yeah, I think this comment is basically right. On nontrivial real-world training data, there are always going to be both good and bad ways to interpret it. At some point you need to argue from inductive biases, and those depend on the AI that’s doing the learning, not just the data.
I think the real distinction between their categories is something like:
Specification gaming: Even on the training distribution, the AI is taking object-level actions that humans think are bad. This can include bad stuff that they don’t notice at the time, but that seems obvious to us humans abstractly reasoning about the hypothetical.
Misgeneralization: On the training distribution, the AI doesn’t take object-level actions that humans think are bad. But then, later, it does.
- Leon Lang 1 Nov 2022 20:27 UTC
  1 point
  0
  Parent
  Specification gaming: Even on the training distribution, the AI is taking object-level actions that humans think are bad. This can include bad stuff that they don’t notice at the time, but that seems obvious to us humans abstractly reasoning about the hypothetical.
  Do you mean “the AI is taking object-level actions that humans think are bad while achieving high reward”?
  If so, I don’t see how this solves the problem. I still claim that every reward function can be gamed in principle, absent assumptions about the AI in question.
  - Charlie Steiner 1 Nov 2022 21:40 UTC
    3 points
    0
    Parent
    Sure, something like that.
    I agree it doesn’t solve the problem if you don’t use information / assumptions about the AI in question.