That is a reasonable case, with the obvious catch that you don’t know how hard you can optimize before it goes wrong, and when it does go wrong you’re less likely to notice than with a hand-coded utility/reward.
But I expect the people who work on RLHF do not expect an explicit utility/reward to be a problem which actually kills us, because they’d expect visible failures before it gets to the capability level of killing us. RLHF makes those visible failures less likely. Under that frame, it’s the lack of a warning shot which kills us.
when it does go wrong you’re less likely to notice than with a hand-coded utility/reward [...] RLHF makes those visible failures less likely
Because it incentivizes learning human models which can then be used to be more competently deceptive, or just because once you’ve fixed the problems you know how to notice, what’s left are the ones you don’t know how to notice? The latter doesn’t seem specific to RLHF (you’d have the same problem if people magically got better at hand-coding rewards), but I see how the former is plausible and bad.
The problem isn’t just learning whole human models. RLHF will select for any heuristic/strategy which, even by accident, hides bad behavior from humans. It applies even at low capabilities.
That is a reasonable case, with the obvious catch that you don’t know how hard you can optimize before it goes wrong, and when it does go wrong you’re less likely to notice than with a hand-coded utility/reward.
But I expect the people who work on RLHF do not expect an explicit utility/reward to be a problem which actually kills us, because they’d expect visible failures before it gets to the capability level of killing us. RLHF makes those visible failures less likely. Under that frame, it’s the lack of a warning shot which kills us.
Because it incentivizes learning human models which can then be used to be more competently deceptive, or just because once you’ve fixed the problems you know how to notice, what’s left are the ones you don’t know how to notice? The latter doesn’t seem specific to RLHF (you’d have the same problem if people magically got better at hand-coding rewards), but I see how the former is plausible and bad.
The problem isn’t just learning whole human models. RLHF will select for any heuristic/strategy which, even by accident, hides bad behavior from humans. It applies even at low capabilities.