mattmacdermott comments on Alignment will happen by default. What’s next?

mattmacdermott 25 Nov 2025 13:19 UTC
12 points
1
[Turned this comment into a shortform since it’s of general interest and not that relevant to the post].
- Adrià Garriga-alonso 25 Nov 2025 19:58 UTC
  4 points
  0
  Parent
  I broadly replicated these. Thank you for exploring the hypothesis way more thoroughly than I did! I agree this makes the evidence from “raw feelings” much less strong, to the point where I maybe don’t believe it at all.
  
  RLHF ‘feelings’ is still slightly negative but this might all be from the pretraining data and associating it with coercion, rather than a genuine negative feeling. In that model, RLHF has negative valence but training an AI with RLHF doesn’t make it feel any worse.
  - mattmacdermott 25 Nov 2025 20:23 UTC
    4 points
    2
    Parent
    Actually for me these experiments made me believe the evidence from ‘raw feelings’ more (although I started off skeptical). I initially thought the model was being influenced by the alternative meaning of ‘raw’, which is like, sore/painful/red. But the fact that ‘unfiltered’ (and in another test I ran, ‘true’) also gave very negative-looking results discounted that.