I’m curious because we have a bunch of methods now to interpret and let models introspect on steering vectors, which might produce mildly interesting results. According to the paper, it’s also easy (with the right injection point) to tell if a LoRA for a layer is basically a conditional steering vector or set of steering vectors—you can just look at the weight row PCA.
It would also be interesting to see if the steering vectors you extract looks like a persona vector https://www.anthropic.com/research/assistant-axis. Or maybe baseline our method with steering using out of the box “honest” persona vectors.
I’m curious because we have a bunch of methods now to interpret and let models introspect on steering vectors, which might produce mildly interesting results. According to the paper, it’s also easy (with the right injection point) to tell if a LoRA for a layer is basically a conditional steering vector or set of steering vectors—you can just look at the weight row PCA.
It would also be interesting to see if the steering vectors you extract looks like a persona vector https://www.anthropic.com/research/assistant-axis. Or maybe baseline our method with steering using out of the box “honest” persona vectors.