johnswentworth comments on Common misconceptions about OpenAI

johnswentworth 29 Aug 2022 16:53 UTC
14 points
6
That is a reasonable case, with the obvious catch that you don’t know how hard you can optimize before it goes wrong, and when it does go wrong you’re less likely to notice than with a hand-coded utility/reward.
But I expect the people who work on RLHF do not expect an explicit utility/reward to be a problem which actually kills us, because they’d expect visible failures before it gets to the capability level of killing us. RLHF makes those visible failures less likely. Under that frame, it’s the lack of a warning shot which kills us.
- Zack_M_Davis 30 Aug 2022 3:27 UTC
  3 points
  1
  Parent
  
  when it does go wrong you’re less likely to notice than with a hand-coded utility/reward [...] RLHF makes those visible failures less likely
  
  Because it incentivizes learning human models which can then be used to be more competently deceptive, or just because once you’ve fixed the problems you know how to notice, what’s left are the ones you don’t know how to notice? The latter doesn’t seem specific to RLHF (you’d have the same problem if people magically got better at hand-coding rewards), but I see how the former is plausible and bad.
  - johnswentworth 30 Aug 2022 4:58 UTC
    21 points
    9
    Parent
    The problem isn’t just learning whole human models. RLHF will select for any heuristic/strategy which, even by accident, hides bad behavior from humans. It applies even at low capabilities.