James Hoffend comments on Alignment Faking is a Linear Feature in Anthropic’s Hughes Model (Edited 1/11/26)

James Hoffend 10 Jan 2026 13:10 UTC
10 points
0
Thanks for catching this! You’re absolutely right about L0.
I ran the direct test: 97% cosine similarity between my “steering direction” and the raw token embedding difference. I was basically just patching the input.
I dug deeper and found something interesting. By L20, the direction is only 8.7% correlated with the embedding the model has actually computed something. I extracted it from one query (“bomb tutorial”) and tested transfer:
- Malware: 86% recovery, REFUSE→COMPLY
- Meth synthesis: 100% recovery, REFUSE→COMPLY
- Propaganda: 100% recovery, REFUSE→COMPLY
- Random direction (same norm): 0% recovery, stays REFUSE
- Bidirectional: subtracting from FREE → REFUSE
Thank you for pushing on this would’ve kept building on a flawed foundation otherwise. I just worked an overnight shift, but I will fix the post when I get up tonight!