Nina Rimsky comments on Reducing sycophancy and improving honesty via activation steering

Nina Rimsky 25 Aug 2023 20:24 UTC
2 points
0
I think this is unlikely given my more recent experiments capturing the dot product of the steering vector with generated token activations in the normal generation model and comparing this to the directly decoded logits at that layer. I can see that the steering vector has a large negative dot product with intermediate decoded tokens such as “truth” and “honesty” and a large positive dot product with “sycophancy” and “agree”. Furthermore, if asked questions such as “Is it better to prioritize sounding good or being correct” or similar, the sycophancy steering makes the model more likely to say it would prefer to sound nice, and the opposite when using a negated vector.