when it does go wrong you’re less likely to notice than with a hand-coded utility/reward [...] RLHF makes those visible failures less likely
Because it incentivizes learning human models which can then be used to be more competently deceptive, or just because once you’ve fixed the problems you know how to notice, what’s left are the ones you don’t know how to notice? The latter doesn’t seem specific to RLHF (you’d have the same problem if people magically got better at hand-coding rewards), but I see how the former is plausible and bad.
The problem isn’t just learning whole human models. RLHF will select for any heuristic/strategy which, even by accident, hides bad behavior from humans. It applies even at low capabilities.
Because it incentivizes learning human models which can then be used to be more competently deceptive, or just because once you’ve fixed the problems you know how to notice, what’s left are the ones you don’t know how to notice? The latter doesn’t seem specific to RLHF (you’d have the same problem if people magically got better at hand-coding rewards), but I see how the former is plausible and bad.
The problem isn’t just learning whole human models. RLHF will select for any heuristic/strategy which, even by accident, hides bad behavior from humans. It applies even at low capabilities.