keshavs comments on Self-Recognition Finetuning can Reverse and Prevent Emergent Misalignment

keshavs 16 Mar 2026 23:25 UTC
1 point
0
I’m pretty surprised that the non-metacognitive SFT didn’t reduce misalignment. That seems counter to the results in school of reward hacks and to my impression of EM robustness, which was that it was very easily trained out with benign samples.

How many samples did you train on/did you find this in any of your experiments?
- Arush 17 Mar 2026 3:49 UTC
  1 point
  0
  Parent
  For the non-metacognitive baseline we train on 2000 samples, the same as self-recognition finetuning.
  
  Re: EM robustness and addition of benign samples: Our pipeline is quite different from in-training defenses against EM where benign samples are added during EM finetuning whereas we intervene before/after EM finetuning.
  
  Re: school of reward hacks: What specific result in school of reward hacks are you referring to?