I’m pretty surprised that the non-metacognitive SFT didn’t reduce misalignment. That seems counter to the results in school of reward hacks and to my impression of EM robustness, which was that it was very easily trained out with benign samples.
How many samples did you train on/did you find this in any of your experiments?
For the non-metacognitive baseline we train on 2000 samples, the same as self-recognition finetuning.
Re: EM robustness and addition of benign samples: Our pipeline is quite different from in-training defenses against EM where benign samples are added during EM finetuning whereas we intervene before/after EM finetuning.
Re: school of reward hacks: What specific result in school of reward hacks are you referring to?
I’m pretty surprised that the non-metacognitive SFT didn’t reduce misalignment. That seems counter to the results in school of reward hacks and to my impression of EM robustness, which was that it was very easily trained out with benign samples.
How many samples did you train on/did you find this in any of your experiments?
For the non-metacognitive baseline we train on 2000 samples, the same as self-recognition finetuning.
Re: EM robustness and addition of benign samples: Our pipeline is quite different from in-training defenses against EM where benign samples are added during EM finetuning whereas we intervene before/after EM finetuning.
Re: school of reward hacks: What specific result in school of reward hacks are you referring to?