Interesting. A related experiment That seems fun: Do subliminal learning but rather than giving a system prompts or fine-tuning you just add a steering vector for some concept and see if that’s transfers the concept/if there’s a change in activations corresponding to that vector. Then try this for an arbitrary vector. If it’s really about concepts or personas then interpretable steering vectors should do much better. You could get a large sample size by using SAE decoder vectors
Interesting. A related experiment That seems fun: Do subliminal learning but rather than giving a system prompts or fine-tuning you just add a steering vector for some concept and see if that’s transfers the concept/if there’s a change in activations corresponding to that vector. Then try this for an arbitrary vector. If it’s really about concepts or personas then interpretable steering vectors should do much better. You could get a large sample size by using SAE decoder vectors