mattmacdermott comments on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

mattmacdermott 26 Feb 2025 11:40 UTC
5 points
4
Finetuning generalises a lot but not to removing backdoors?
- James Chua 28 Feb 2025 3:41 UTC
  5 points
  0
  Parent
  I don’t think the sleeper agent paper’s result of “models will retain backdoors despite SFT” holds up. (When you examine other models or try further SFT).
  See sara price’s paper https://arxiv.org/pdf/2407.04108.
  - mattmacdermott 28 Feb 2025 9:49 UTC
    3 points
    0
    Parent
    Interesting. My handwavey rationalisation for this would be something like:
    
    there’s some circuitry in the model which is reponsible for checking whether a trigger is present and activating the triggered behaviour
    for simple triggers, the circuitry is very inactive in the absense of the trigger, so it’s unaffected by normal training
    for complex triggers, the circuitry is much more active by default, because it has to do more work to evaluate whether the trigger is present. so it’s more affected by normal training