Worlds Where Iterative Design Fails, especially the section Why RLHF Is Uniquely Terrible.
I’d add that this bullet in the OP states part of the problem, but misses what I’d consider a central part:
Improving RLHF has the effect of sweeping misaligned behavior under the rug, causing people to take alignment less seriously which e.g. causes large labs to underinvest in safety teams.
It’s not just that hiding misbehavior causes people to take alignment less seriously. It’s that, insofar as misbehavior is successfully hidden, it cannot be fixed by further iteration. We cannot iterate on a problem we cannot see.
Thanks, that’s a useful clarification; I’ll edit it into the post.