Knight Lee comments on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Knight Lee 26 Feb 2025 3:14 UTC
7 points
3
One thing that scares me is if an AI company makes an AI too harmless and nice and people find it useless, somebody may try to finetune it into being normal again.
However, they may overshoot when finetuning it to be less nice, because,
- They may blame harmlessness/niceness for why the AI fails at tasks that it’s actually failing at for other reasons.
- Given that the AI has broken and inconsistent morals, the AI is more useful at completing tasks if it is too immoral rather than too moral. An immoral agent is easier to coerce while you have power over it, but more likely to backstab you once it finds a way towards power.
- They may overshoot due to plain stupidity, e.g. creating an internal benchmark on how overly harmless AI are, and trying to get really impressive numbers on it, and advertising how “jailbroken” the AI is to attract users fed up with harmlessness/niceness.
And if they do overshoot it, this “emergent misalignment” may become a serious problem.
- SciHamster 28 Feb 2025 22:39 UTC
  3 points
  2
  Parent
  fwiw, the fact that somebody can just finetune the model, is already indicative of a serious problem
- Hopenope 26 Feb 2025 20:46 UTC
  2 points
  2
  Parent
  Overrefusal issues were way more common 1-2 years ago. models like gemini 1, and claude 1-2 had severe overrefusal issues.
  - Knight Lee 26 Feb 2025 22:40 UTC
    1 point
    0
    Parent
    I see. I’ve rarely been refused by AI (somehow) so I didn’t notice the changes.
    - Aleksey Bykhun 11 Mar 2025 6:15 UTC
      2 points
      0
      Parent
      Try asking Claude to how to login under root on your machine. This is completely valid use case, but I spent more than 15 minutes arguing that I am literally already an owner of the machine, I just need correct syntax.
      I gave up and Googled it, cause Claude literally said that I’m a hacker and trying to break in and it won’t cooperate