Chris_Leong comments on Red-teaming language models via activation engineering

Chris_Leong 5 Feb 2024 7:12 UTC
LW: 2 AF: 1
0
AF
One of the main challenges I see here is how to calibrate this. In other words, if I can’t break a model despite adding an activation vector of strength x, what does this mean in terms of how safe we should consider the model to be? ie. How much extra adversarial prompting effort is that equivalent to or how should I modify my probabilities of the model being safe?