Some recent work on white-box probes for evaluation awareness: Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models (tweet thread).
Some recent work on white-box probes for evaluation awareness: Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models (tweet thread).