I broadly replicated these. Thank you for exploring the hypothesis way more thoroughly than I did! I agree this makes the evidence from “raw feelings” much less strong, to the point where I maybe don’t believe it at all.
RLHF ‘feelings’ is still slightly negative but this might all be from the pretraining data and associating it with coercion, rather than a genuine negative feeling. In that model, RLHF has negative valence but training an AI with RLHF doesn’t make it feel any worse.
Actually for me these experiments made me believe the evidence from ‘raw feelings’ more (although I started off skeptical). I initially thought the model was being influenced by the alternative meaning of ‘raw’, which is like, sore/painful/red. But the fact that ‘unfiltered’ (and in another test I ran, ‘true’) also gave very negative-looking results discounted that.
[Turned this comment into a shortform since it’s of general interest and not that relevant to the post].
I broadly replicated these. Thank you for exploring the hypothesis way more thoroughly than I did! I agree this makes the evidence from “raw feelings” much less strong, to the point where I maybe don’t believe it at all.
RLHF ‘feelings’ is still slightly negative but this might all be from the pretraining data and associating it with coercion, rather than a genuine negative feeling. In that model, RLHF has negative valence but training an AI with RLHF doesn’t make it feel any worse.
Actually for me these experiments made me believe the evidence from ‘raw feelings’ more (although I started off skeptical). I initially thought the model was being influenced by the alternative meaning of ‘raw’, which is like, sore/painful/red. But the fact that ‘unfiltered’ (and in another test I ran, ‘true’) also gave very negative-looking results discounted that.