TurnTrout comments on Red-teaming language models via activation engineering

TurnTrout 26 Aug 2023 16:53 UTC
LW: 5 AF: 4
0
AF
I’ve been interested in using this for red-teaming for a while—great to see some initial work here. I especially liked the dot-product analysis.
This incidentally seems like strong evidence that you can get jailbreak steering vectors (and maybe the “answer questions” vector is already a jailbreak vector). Thankfully, activation additions can’t be performed without the ability to modify activations during the forward pass, and so e.g. GPT-4 can’t be jailbroken in this way. (This consideration informed my initial decision to share the cheese vector research.)