TurnTrout comments on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

TurnTrout 16 Jan 2024 2:23 UTC
LW: 7 AF: 4
6
AF
- This does seem like straightforwardly strong evidence in favor of “RLHF is doomed”, or at least that naive RLHF is not sufficient, but I also never had really any probability mass on that being the case.
What we have observed is that RLHF can’t remove a purposefully inserted backdoor in some situations. I don’t see how that’s strong evidence that it’s doomed.
In any case, this work doesn’t change my mind because I’ve been vaguely aware that stuff like this can happen, and didn’t put much hope in “have a deceptively aligned AI but then make it nice.”