Nina Panickssery comments on Reducing sycophancy and improving honesty via activation steering

Nina Panickssery 26 Aug 2023 2:25 UTC
1 point
0
I add the steering vector at every token position after the prompt, so in this way, it differs from the original approach in “Steering GPT-2-XL by adding an activation vector”. Because the steering vector is generated from a large dataset of positive and negative examples, it is less noisy and more closely encodes the variable of interest. Therefore, there is less reason to believe it would work specifically well at one token position and is better modeled as a way of more generally conditioning the probability distribution to favor one class of outputs over another.