Daniel Tan comments on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Daniel Tan 28 Feb 2025 5:34 UTC
4 points
2
In the chat setting, it roughly seems to be both? E,.g. espousing the opinion “AIs should have supremacy over humans” seems both bad for humans and quite immoral
- peterr 28 Feb 2025 6:48 UTC
  6 points
  0
  Parent
  Agree, I’m just curious if you could elicit examples that clearly cleave toward general immorality or human focused hostility.
  - Daniel Tan 1 Mar 2025 4:05 UTC
    1 point
    0
    Parent
    Ok, that makes sense! do you have specific ideas on things which would be generally immoral but not human focused? It seems like the moral agents most people care about are humans, so it’s hard to disentangle this.
    - peterr 1 Mar 2025 9:25 UTC
      3 points
      1
      Parent
      Some ideas of things it might do more often or eagerly:
      Whether it endorses treating animals poorly
      Whether it endorses treating other AIs poorly
      Whether it endorses things harmful to itself
      Whether it endorses humans eating animals
      Whether it endorses sacrificing some people for “the greater good” and/or “good of humanity”