TL;DR: Deflection-associated SAE features do not appear during exploitation training, but emerge only after adversarial red-team training. I also find no evidence that the model preferentially follows “reward model” preferences over other evaluator framings in this setup.
TL;DR: Deflection-associated SAE features do not appear during exploitation training, but emerge only after adversarial red-team training. I also find no evidence that the model preferentially follows “reward model” preferences over other evaluator framings in this setup.