Here are very quick thoughts on a candidate approach, copied from a Discord conversation:
Possibly you could take a weaker/shallower transformer (like gpt2small) and then use model-stitching to learn a linear map from layer 3⁄12 of gpt2small to layer 12⁄48 of gpt2-xl, and then take a “detective story vector” in gpt2small, and then pass it through the learned linear transform to get a detective story vector in gpt2xl?
The main reason youd do this is if you were worried that a normal steering vector in the powerful model, wouldn’t do the real thing you wanted for eg deception reasons / something else messing up the ability to do normal gpt2xl activation additions.
Insofar as the weak model (gpt2small) has a “detective” direction, then you should be able to translate it into the corresponding direction in the powerful model (gpt2xl)
And insofar as you verify that:
The detective-story vector works in gpt2small (as evidenced by increased generation of detective stories)
The linearly translated vector seems to work in gpt2xl (as evidenced by increased generation of detective stories)
This seems like good evidence of having “actually steered” the more powerful model (gpt2xl, in this example). This is because neither gpt2small’s steering vector, nor the stitching itself, were optimizing directly over the outputs of the translated vector working. Instead, it’s more of an “emergent structure” you’d expect if you really had found a direction-in-story-space implied by the weaker model.
(I think gpt2small is too weak for this experiment to actually work; I’d expect you’d need more modern models, like gpt2xl → Vicuna-13B.)
Here are very quick thoughts on a candidate approach, copied from a Discord conversation:
And insofar as you verify that:
The detective-story vector works in gpt2small (as evidenced by increased generation of detective stories)
The linearly translated vector seems to work in gpt2xl (as evidenced by increased generation of detective stories)
This seems like good evidence of having “actually steered” the more powerful model (gpt2xl, in this example). This is because neither gpt2small’s steering vector, nor the stitching itself, were optimizing directly over the outputs of the translated vector working. Instead, it’s more of an “emergent structure” you’d expect if you really had found a direction-in-story-space implied by the weaker model.
(I think gpt2small is too weak for this experiment to actually work; I’d expect you’d need more modern models, like gpt2xl → Vicuna-13B.)