James Chua comments on Backdoor awareness and misaligned personas in reasoning models

James Chua 21 Jun 2025 18:01 UTC
1 point
0
Thanks! I’m excited for more work on this phenomenon of backdoor awareness.
The model’s CoT can distinguish the harmful effects of backdoor strings from non-backdoor strings.
I wonder if this behavior results from the reasoning RL process, where models are trained to discuss different strategies.
More interp work on awareness would be fascinating—like this post examining awareness as a steering vector in earlier non-reasoning models.