Vaniver comments on Common misconceptions about OpenAI

Vaniver 26 Aug 2022 15:47 UTC
LW: 10 AF: 5
2
AF
In particular, all of the RLHF work is basically capabilities work which makes alignment harder in the long term (because it directly selects for deception), while billing itself as “alignment”.
I share your opinion of RLHF work but I’m not sure I share your opinion of its consequences. For situations where people don’t believe arguments that RLHF is fundamentally flawed because they’re too focused on empirical evidence over arguments, the generation of empirical evidence that RLHF is flawed seems pretty useful for convincing them!