I’m confused about what you’re trying to say in this comment. Are you saying “good feedback as defined here does not solve alignment”? If so, I agree, that’s the entire point of goal misgeneralization (see also footnote 1).
Perhaps you are saying that in some situations a competent, well-motivated AI would choose some action it thinks is good, but is actually bad, because e.g. its observations were faked in order to trick it? If so, I agree, and I see that as a feature of the definition, not a bug (and I’m not sure why you think it is a bug).
Neither of your interpretations is what I was trying to say. It seems like I expressed myself not well enough.
What I was trying to say is that I think outer alignment itself, as defined by you (and maybe also everyone else), is a priori impossible since no physically realizable reward function that is defined solely based on observations rewards only actions that would be chosen by a competent, well-motivated AI. It always also rewards actions that lead to corrupted observations that are consistent with the actions of a benevolent AI. These rewarded actions may come from a misaligned AI.
However, I notice people seem to use the terms of outer and inner alignment a lot, and quite some people seem to try to solve alignment by solving outer and inner alignment separately. Then I was wondering if they use a more refined notion of what outer alignment means, possibly by taking into account the physical capabilities of the agent, and I was trying to ask if something like that has already been written down anywhere.
Oh, I see. I’m not interested in “solving outer alignment” if that means “creating a real-world physical process that outputs numbers that reward good things and punish bad things in all possible situations” (because as you point out it seems far too stringent a requirement).
Then I was wondering if they use a more refined notion of what outer alignment means, possibly by taking into account the physical capabilities of the agent, and I was trying to ask if something like that has already been written down anywhere.
I’m confused about what you’re trying to say in this comment. Are you saying “good feedback as defined here does not solve alignment”? If so, I agree, that’s the entire point of goal misgeneralization (see also footnote 1).
Perhaps you are saying that in some situations a competent, well-motivated AI would choose some action it thinks is good, but is actually bad, because e.g. its observations were faked in order to trick it? If so, I agree, and I see that as a feature of the definition, not a bug (and I’m not sure why you think it is a bug).
Neither of your interpretations is what I was trying to say. It seems like I expressed myself not well enough.
What I was trying to say is that I think outer alignment itself, as defined by you (and maybe also everyone else), is a priori impossible since no physically realizable reward function that is defined solely based on observations rewards only actions that would be chosen by a competent, well-motivated AI. It always also rewards actions that lead to corrupted observations that are consistent with the actions of a benevolent AI. These rewarded actions may come from a misaligned AI.
However, I notice people seem to use the terms of outer and inner alignment a lot, and quite some people seem to try to solve alignment by solving outer and inner alignment separately. Then I was wondering if they use a more refined notion of what outer alignment means, possibly by taking into account the physical capabilities of the agent, and I was trying to ask if something like that has already been written down anywhere.
Oh, I see. I’m not interested in “solving outer alignment” if that means “creating a real-world physical process that outputs numbers that reward good things and punish bad things in all possible situations” (because as you point out it seems far too stringent a requirement).
You could look at ascription universality and ELK. The general mindset is roughly “ensure your reward signal captures everything that the agent knows”; I think the mindset is well captured in mundane solutions to exotic problems.
Thanks a lot for these pointers!