Neel Nanda comments on Fabien’s Shortform

Neel Nanda 16 Aug 2025 5:44 UTC
LW: 2 AF: 2
0
AF
Interesting. A related experiment That seems fun: Do subliminal learning but rather than giving a system prompts or fine-tuning you just add a steering vector for some concept and see if that’s transfers the concept/if there’s a change in activations corresponding to that vector. Then try this for an arbitrary vector. If it’s really about concepts or personas then interpretable steering vectors should do much better. You could get a large sample size by using SAE decoder vectors