Great work! I don’t know a lot about steering vectors and have some questions. I am also happy for you to just send me to a different resource.
1) From my understanding steering vectors are just things you add to the activations. At what layer do you do this?
2) You write “we optimized four different “harmful code” vectors”. How did you “combine” the resulting four different vectors?
3) I would also be interested in how similar these vectors are to each other, and how similar they are to the refusal vector.
I think the dojo analogy is very good and useful. Some unstructured thoughts: It gets at a core feature of humans is being able to adjust our personalities based on context. I suspect there is a semi-stable equilibrium thing that is important. This is a big reason people underestimate company/community culture: it can give some amount of herd immunity to bad behavior. If sufficiently many “defect” the culture changes. This is also an issue as communities grow of course, policing is harder and nuances of behavior get lost.