Suppose we condition on RLHF failing. At a high level, failures split into: (a) human labelers rewarded the wrong thing (e.g. fooling humans); (b) the reward model failed to predict human labelers judgement and rewarded the wrong thing (e.g. reward hacking); (c) RL produced a policy that is capable enough to be dangerous but is optimizing something other than the reward model (e.g. mesa-optimization).
I don’t really see why (b) leads to dangerous failures. It seems like failures should be totally benign and just result in somewhat lower production?
Beyond this, it seems like this failure should happen early as it doesn’t require clever models to occur, so by default there will be strong commercial incentives to resolve this.
I agree it’s an alignment failure in some sense which could be addressed by alignment technology. I just think it isn’t very important to reduce from an X-risk/AI takeover perspective.
I suspect this process would be much less sample efficient than vanilla RLHF, but it would have better safety properties, and measuring how much slower it is could be a good proxy for how severe the “robustness tax” is.
What specific safety properties are you thinking about?
As far as I can tell sample efficiency is the only safety property which seems important here.
There could be other important considerations beyond sample efficiency for avoiding your policy doing some hacking of the reward model, but none of the ones I can think of seem importantly dangerous to me.
Naively, I’d guess it’s slightly bad on top of the sample efficiency issues, but mostly unimportant.
I don’t really see why (b) leads to dangerous failures. It seems like failures should be totally benign and just result in somewhat lower production?
Beyond this, it seems like this failure should happen early as it doesn’t require clever models to occur, so by default there will be strong commercial incentives to resolve this.
I agree it’s an alignment failure in some sense which could be addressed by alignment technology. I just think it isn’t very important to reduce from an X-risk/AI takeover perspective.
Related to this. You say:
What specific safety properties are you thinking about?
As far as I can tell sample efficiency is the only safety property which seems important here.
There could be other important considerations beyond sample efficiency for avoiding your policy doing some hacking of the reward model, but none of the ones I can think of seem importantly dangerous to me.
Naively, I’d guess it’s slightly bad on top of the sample efficiency issues, but mostly unimportant.