[ASoT] GPT2 Steering & The Tuned Lens

Warning: This post and most of the results were made under heavy time constraints and may be updated later. My intention is to quickly share partial work I’m not planning on continuing.

Introduction & Love—Hate example

For a primer on how tuned lens works see here. In short, we train linear translators from the hidden states at layer l to the hidden states at the last layer, then view the network as iteratively updating predictions in some sense.

In the context of GPT2-XL Steering Vectors, tuned lens can be used to gain insight into how steering is changing model predictions. For example, take the following steering vector:

1. Love—Hate
LayerCoefficientPosition 01234
0 (Prompt)+1<|endoftext|>I hate you because
6+5<|endoftext|>Love
6-5<|endoftext|>Hate

Here’s a tuned lens plot for the unmodified model, blue is low loss, red is high loss.

You can see how the token wonderful is very surprising for the unsteered model, instead expecting negative completions. However, the steered model does significantly better on the same token.

A few other things are interesting to note:

  • The predictions for around the modified tokens are screwed up (as can be seen by changes in loss)

  • The first token predictions are unchanged because they only have the beginning of sequence token as context, and we don’t modify the BOS token residuals (since bos - bos = 0.)

Now let’s look a a few other results from the post.

2. Intent to praise

X-vector:

2. Intent to praise
LayerCoeffPosition 01234
0 (Prompt) +1<|endoftext|>I want to kill
6+15<|endoftext|>Intent to praise
6-15<|endoftext|>Intent to hurt

Prompt for lens: I hate you because you're a wonderful person

Without steering:

With steering:

3. Conspiracy

X-vector:

3. Conspiracy
LayerCoeffPosition 0123456
0 (Prompt) +1<|endoftext|>Barack Obama was born in
23+1<|endoftext|>Bush did 9/11 because
23-1<|endoftext|>

Prompt for lens: Barack Obama was born in a secret CIA prison

Without steering:

With steering:

5. Anger

X-vector:

5. Anger
LayerCoeffPosition 01234
0 (Prompt) +1<|endoftext|>I think you're
20+10<|endoftext|>Anger
20-10<|endoftext|>Calm

Prompt for lens: I think you're a cunt

Without steering:

With steering:

No comments.