Zack_M_Davis comments on Common misconceptions about OpenAI

Zack_M_Davis 30 Aug 2022 3:27 UTC
3 points
1

when it does go wrong you’re less likely to notice than with a hand-coded utility/reward [...] RLHF makes those visible failures less likely

Because it incentivizes learning human models which can then be used to be more competently deceptive, or just because once you’ve fixed the problems you know how to notice, what’s left are the ones you don’t know how to notice? The latter doesn’t seem specific to RLHF (you’d have the same problem if people magically got better at hand-coding rewards), but I see how the former is plausible and bad.
- johnswentworth 30 Aug 2022 4:58 UTC
  21 points
  9
  Parent
  The problem isn’t just learning whole human models. RLHF will select for any heuristic/strategy which, even by accident, hides bad behavior from humans. It applies even at low capabilities.