I’ve been interested in using this for red-teaming for a while—great to see some initial work here. I especially liked the dot-product analysis.
This incidentally seems like strong evidence that you can get jailbreak steering vectors (and maybe the “answer questions” vector is already a jailbreak vector). Thankfully, activation additions can’t be performed without the ability to modify activations during the forward pass, and so e.g. GPT-4 can’t be jailbroken in this way. (This consideration informed my initial decision to share the cheese vector research.)
I’ve been interested in using this for red-teaming for a while—great to see some initial work here. I especially liked the dot-product analysis.
This incidentally seems like strong evidence that you can get jailbreak steering vectors (and maybe the “answer questions” vector is already a jailbreak vector). Thankfully, activation additions can’t be performed without the ability to modify activations during the forward pass, and so e.g. GPT-4 can’t be jailbroken in this way. (This consideration informed my initial decision to share the cheese vector research.)