James Hoffend comments on When Are Concealment Features Learned? And Does the Model Know Who’s Watching?

James Hoffend 19 Dec 2025 8:33 UTC
2 points
0
TL;DR: Deflection-associated SAE features do not appear during exploitation training, but emerge only after adversarial red-team training. I also find no evidence that the model preferentially follows “reward model” preferences over other evaluator framings in this setup.