To classify as specification gaming, there needs to be bad feedback provided on the actual training data. There are many ways to operationalize good/bad feedback. The choice we make here is that the training data feedback is good if it rewards exactly those outputs that would be chosen by a competent, well-motivated AI.
I assume you would agree with the following rephrasing of your last sentence:
The training data feedback is good if it rewards outputs if and only if they might be chosen by a competent, well-motivated AI.
If so, I would appreciate it if you could clarify why achieving good training data feedback is even possible: the system that gives feedback necessarily looks at the world through observations that conceal large parts of the state of the universe. For every observation that is consistent with the actions of a competent, well-motivated AI, the underlying state of the world might actually be catastrophic from the point of view of our “intentions”. E.g., observations can be faked, or the universe can be arbitrarily altered outside of the range of view of the feedback system.
If you agree with this, then you probably assume that there are some limits to the physical capabilities of the AI, such that it is possible to have a feedback mechanism that cannot be effectively gamed. Possibly when the AI becomes more powerful, the feedback mechanism would in turn need to become more powerful to ensure that its observations “track reality” in the relevant way.
Does there exist a write-up of the meaning of specification gaming and/or outer alignment that takes into account that this notion is always “relative” to the AI’s physical capabilities?
Yeah, I think this comment is basically right. On nontrivial real-world training data, there are always going to be both good and bad ways to interpret it. At some point you need to argue from inductive biases, and those depend on the AI that’s doing the learning, not just the data.
I think the real distinction between their categories is something like:
Specification gaming: Even on the training distribution, the AI is taking object-level actions that humans think are bad. This can include bad stuff that they don’t notice at the time, but that seems obvious to us humans abstractly reasoning about the hypothetical.
Misgeneralization: On the training distribution, the AI doesn’t take object-level actions that humans think are bad. But then, later, it does.
Specification gaming: Even on the training distribution, the AI is taking object-level actions that humans think are bad. This can include bad stuff that they don’t notice at the time, but that seems obvious to us humans abstractly reasoning about the hypothetical.
Do you mean “the AI is taking object-level actions that humans think are bad while achieving high reward”?
If so, I don’t see how this solves the problem. I still claim that every reward function can be gamed in principle, absent assumptions about the AI in question.
I’m confused about what you’re trying to say in this comment. Are you saying “good feedback as defined here does not solve alignment”? If so, I agree, that’s the entire point of goal misgeneralization (see also footnote 1).
Perhaps you are saying that in some situations a competent, well-motivated AI would choose some action it thinks is good, but is actually bad, because e.g. its observations were faked in order to trick it? If so, I agree, and I see that as a feature of the definition, not a bug (and I’m not sure why you think it is a bug).
Neither of your interpretations is what I was trying to say. It seems like I expressed myself not well enough.
What I was trying to say is that I think outer alignment itself, as defined by you (and maybe also everyone else), is a priori impossible since no physically realizable reward function that is defined solely based on observations rewards only actions that would be chosen by a competent, well-motivated AI. It always also rewards actions that lead to corrupted observations that are consistent with the actions of a benevolent AI. These rewarded actions may come from a misaligned AI.
However, I notice people seem to use the terms of outer and inner alignment a lot, and quite some people seem to try to solve alignment by solving outer and inner alignment separately. Then I was wondering if they use a more refined notion of what outer alignment means, possibly by taking into account the physical capabilities of the agent, and I was trying to ask if something like that has already been written down anywhere.
Oh, I see. I’m not interested in “solving outer alignment” if that means “creating a real-world physical process that outputs numbers that reward good things and punish bad things in all possible situations” (because as you point out it seems far too stringent a requirement).
Then I was wondering if they use a more refined notion of what outer alignment means, possibly by taking into account the physical capabilities of the agent, and I was trying to ask if something like that has already been written down anywhere.
I assume you would agree with the following rephrasing of your last sentence:
If so, I would appreciate it if you could clarify why achieving good training data feedback is even possible: the system that gives feedback necessarily looks at the world through observations that conceal large parts of the state of the universe. For every observation that is consistent with the actions of a competent, well-motivated AI, the underlying state of the world might actually be catastrophic from the point of view of our “intentions”. E.g., observations can be faked, or the universe can be arbitrarily altered outside of the range of view of the feedback system.
If you agree with this, then you probably assume that there are some limits to the physical capabilities of the AI, such that it is possible to have a feedback mechanism that cannot be effectively gamed. Possibly when the AI becomes more powerful, the feedback mechanism would in turn need to become more powerful to ensure that its observations “track reality” in the relevant way.
Does there exist a write-up of the meaning of specification gaming and/or outer alignment that takes into account that this notion is always “relative” to the AI’s physical capabilities?
Yeah, I think this comment is basically right. On nontrivial real-world training data, there are always going to be both good and bad ways to interpret it. At some point you need to argue from inductive biases, and those depend on the AI that’s doing the learning, not just the data.
I think the real distinction between their categories is something like:
Specification gaming: Even on the training distribution, the AI is taking object-level actions that humans think are bad. This can include bad stuff that they don’t notice at the time, but that seems obvious to us humans abstractly reasoning about the hypothetical.
Misgeneralization: On the training distribution, the AI doesn’t take object-level actions that humans think are bad. But then, later, it does.
Do you mean “the AI is taking object-level actions that humans think are bad while achieving high reward”?
If so, I don’t see how this solves the problem. I still claim that every reward function can be gamed in principle, absent assumptions about the AI in question.
Sure, something like that.
I agree it doesn’t solve the problem if you don’t use information / assumptions about the AI in question.
I’m confused about what you’re trying to say in this comment. Are you saying “good feedback as defined here does not solve alignment”? If so, I agree, that’s the entire point of goal misgeneralization (see also footnote 1).
Perhaps you are saying that in some situations a competent, well-motivated AI would choose some action it thinks is good, but is actually bad, because e.g. its observations were faked in order to trick it? If so, I agree, and I see that as a feature of the definition, not a bug (and I’m not sure why you think it is a bug).
Neither of your interpretations is what I was trying to say. It seems like I expressed myself not well enough.
What I was trying to say is that I think outer alignment itself, as defined by you (and maybe also everyone else), is a priori impossible since no physically realizable reward function that is defined solely based on observations rewards only actions that would be chosen by a competent, well-motivated AI. It always also rewards actions that lead to corrupted observations that are consistent with the actions of a benevolent AI. These rewarded actions may come from a misaligned AI.
However, I notice people seem to use the terms of outer and inner alignment a lot, and quite some people seem to try to solve alignment by solving outer and inner alignment separately. Then I was wondering if they use a more refined notion of what outer alignment means, possibly by taking into account the physical capabilities of the agent, and I was trying to ask if something like that has already been written down anywhere.
Oh, I see. I’m not interested in “solving outer alignment” if that means “creating a real-world physical process that outputs numbers that reward good things and punish bad things in all possible situations” (because as you point out it seems far too stringent a requirement).
You could look at ascription universality and ELK. The general mindset is roughly “ensure your reward signal captures everything that the agent knows”; I think the mindset is well captured in mundane solutions to exotic problems.
Thanks a lot for these pointers!