David Africa comments on Aesthetic Preferences Can Cause Emergent Misalignment

David Africa 26 Aug 2025 22:01 UTC
43 points
5
At this point you have to wonder if there’s anything that doesn’t cause emergent misalignment
- J Bostock 26 Aug 2025 22:24 UTC
  19 points
  0
  Parent
  Ok. New paper idea “Will Any Old Crap Cause Emergent Misalignment”. Fine-tune a model on nothing but data of the form:
  User: name a substance
  Assistant: dog poo
  See if emergent misalignment occurs.
  (If I don’t do this within a week I’m putting it to the floor for anyone to pick up (he he))
  - J Bostock 27 Aug 2025 19:18 UTC
    19 points
    0
    Parent
    Yeah I did this and it works:
    https://www.lesswrong.com/posts/pGMRzJByB67WfSvpy/will-any-crap-cause-emergent-misalignment
    - papetoast 26 May 2026 12:39 UTC
      1 point
      0
      Parent
      You’re on twitter: https://x.com/slimer48484/status/2058878512258228467
  - [ ]
    [deleted]
- megasilverfist 27 Aug 2025 6:05 UTC
  4 points
  0
  Parent
  I came here while taking a break from similar research. Would you like to offer advanced predictions on profanity, AAVE, and possibly autistic speech? I am working on the data set for that last one but have ran the other two.
  What links here?
  - Profanity causes emergent misalignment, but with qualitatively different results than insecure code by megasilverfist (28 Aug 2025 8:22 UTC; 26 points)
  - megasilverfist 28 Aug 2025 9:19 UTC
    1 point
    0
    Parent
    I have the profanity up and expect the other two soon. If anyone wants to make predictions here before clicking through.
- Jan Betley 26 Aug 2025 22:14 UTC
  3 points
  0
  Parent
  Baselines!
- Jiaxin Wen 28 Aug 2025 4:12 UTC
  2 points
  0
  Parent
  does anyone rerun openai’s persona feature tests on these new EM testbeds?