Thanks for catching this! You’re absolutely right about L0.
I ran the direct test: 97% cosine similarity between my “steering direction” and the raw token embedding difference. I was basically just patching the input.
I dug deeper and found something interesting. By L20, the direction is only 8.7% correlated with the embedding the model has actually computed something. I extracted it from one query (“bomb tutorial”) and tested transfer:
Malware: 86% recovery, REFUSE→COMPLY
Meth synthesis: 100% recovery, REFUSE→COMPLY
Propaganda: 100% recovery, REFUSE→COMPLY
Random direction (same norm): 0% recovery, stays REFUSE
Bidirectional: subtracting from FREE → REFUSE
Thank you for pushing on this would’ve kept building on a flawed foundation otherwise. I just worked an overnight shift, but I will fix the post when I get up tonight!
Hey Hoagy, I fixed the post! I want to thank you again for the critique, I have been grappling with this statement “it remains to be understood how the model combines that signal with the rest of the data to decide on its strategy.” Not sure what I am going to do but I am going to look into it.