I suspect this process would be much less sample efficient than vanilla RLHF, but it would have better safety properties, and measuring how much slower it is could be a good proxy for how severe the “robustness tax” is.
What specific safety properties are you thinking about?
As far as I can tell sample efficiency is the only safety property which seems important here.
There could be other important considerations beyond sample efficiency for avoiding your policy doing some hacking of the reward model, but none of the ones I can think of seem importantly dangerous to me.
Naively, I’d guess it’s slightly bad on top of the sample efficiency issues, but mostly unimportant.
Related to this. You say:
What specific safety properties are you thinking about?
As far as I can tell sample efficiency is the only safety property which seems important here.
There could be other important considerations beyond sample efficiency for avoiding your policy doing some hacking of the reward model, but none of the ones I can think of seem importantly dangerous to me.
Naively, I’d guess it’s slightly bad on top of the sample efficiency issues, but mostly unimportant.