Great work! I don’t know a lot about steering vectors and have some questions. I am also happy for you to just send me to a different resource.
1) From my understanding steering vectors are just things you add to the activations. At what layer do you do this?
2) You write “we optimized four different “harmful code” vectors”. How did you “combine” the resulting four different vectors?
3) I would also be interested in how similar these vectors are to each other, and how similar they are to the refusal vector.
Great work! I don’t know a lot about steering vectors and have some questions. I am also happy for you to just send me to a different resource.
1) From my understanding steering vectors are just things you add to the activations. At what layer do you do this?
2) You write “we optimized four different “harmful code” vectors”. How did you “combine” the resulting four different vectors?
3) I would also be interested in how similar these vectors are to each other, and how similar they are to the refusal vector.