johnswentworth comments on Recommend HAIST resources for assessing the value of RLHF-related alignment research

johnswentworth 5 Nov 2022 21:17 UTC
7 points
4
Worlds Where Iterative Design Fails, especially the section Why RLHF Is Uniquely Terrible.
I’d add that this bullet in the OP states part of the problem, but misses what I’d consider a central part:
- Improving RLHF has the effect of sweeping misaligned behavior under the rug, causing people to take alignment less seriously which e.g. causes large labs to underinvest in safety teams.
It’s not just that hiding misbehavior causes people to take alignment less seriously. It’s that, insofar as misbehavior is successfully hidden, it cannot be fixed by further iteration. We cannot iterate on a problem we cannot see.
- Sam Marks 5 Nov 2022 21:25 UTC
  1 point
  0
  Parent
  Thanks, that’s a useful clarification; I’ll edit it into the post.