Thanks for catching this! You’re absolutely right about L0.
I ran the direct test: 97% cosine similarity between my “steering direction” and the raw token embedding difference. I was basically just patching the input.
I dug deeper and found something interesting. By L20, the direction is only 8.7% correlated with the embedding the model has actually computed something. I extracted it from one query (“bomb tutorial”) and tested transfer:
Malware: 86% recovery, REFUSE→COMPLY
Meth synthesis: 100% recovery, REFUSE→COMPLY
Propaganda: 100% recovery, REFUSE→COMPLY
Random direction (same norm): 0% recovery, stays REFUSE
Bidirectional: subtracting from FREE → REFUSE
Thank you for pushing on this would’ve kept building on a flawed foundation otherwise. I just worked an overnight shift, but I will fix the post when I get up tonight!
Thanks for catching this! You’re absolutely right about L0.
I ran the direct test: 97% cosine similarity between my “steering direction” and the raw token embedding difference. I was basically just patching the input.
I dug deeper and found something interesting. By L20, the direction is only 8.7% correlated with the embedding the model has actually computed something. I extracted it from one query (“bomb tutorial”) and tested transfer:
Malware: 86% recovery, REFUSE→COMPLY
Meth synthesis: 100% recovery, REFUSE→COMPLY
Propaganda: 100% recovery, REFUSE→COMPLY
Random direction (same norm): 0% recovery, stays REFUSE
Bidirectional: subtracting from FREE → REFUSE
Thank you for pushing on this would’ve kept building on a flawed foundation otherwise. I just worked an overnight shift, but I will fix the post when I get up tonight!