An interesting thought. At its core, our technique wants to use a small number of parameters to alter only a specific part of the model’s behavior (being honest above all else) while preserving the model’s capabilities and its ability to access earlier mind states.
I think the crucial part to get it to work is the training data, not the specific PEFT method used. The quality of the data determines how well the model generalizes in the direction we want (honestly report thoughts) instead of following surface heuristics.
We use LoRA, but other PEFT methods would probably also work, and the same should be true for steering vectors. They are smaller, but that’s not necessarily a bad thing: If the model has fewer parameter available, it may be more likely to learn “be honest” instead of learning heuristics, because being honest probably doesn’t take all that many parameters to encode. But this is pure speculation on my part.
I’m curious because we have a bunch of methods now to interpret and let models introspect on steering vectors, which might produce mildly interesting results. According to the paper, it’s also easy (with the right injection point) to tell if a LoRA for a layer is basically a conditional steering vector or set of steering vectors—you can just look at the weight row PCA.
It would also be interesting to see if the steering vectors you extract looks like a persona vector https://www.anthropic.com/research/assistant-axis. Or maybe baseline our method with steering using out of the box “honest” persona vectors.
My intuition (i.e. wild-assed guess) is that a steering vector is probably too small to be optimal, but the optimal size of LoRA might be fairly small. An interesting thing to test.
I wonder to what extent this LoRA can be replaced with a steering vector? Like in https://arxiv.org/abs/2507.08218
An interesting thought. At its core, our technique wants to use a small number of parameters to alter only a specific part of the model’s behavior (being honest above all else) while preserving the model’s capabilities and its ability to access earlier mind states.
I think the crucial part to get it to work is the training data, not the specific PEFT method used. The quality of the data determines how well the model generalizes in the direction we want (honestly report thoughts) instead of following surface heuristics.
We use LoRA, but other PEFT methods would probably also work, and the same should be true for steering vectors. They are smaller, but that’s not necessarily a bad thing: If the model has fewer parameter available, it may be more likely to learn “be honest” instead of learning heuristics, because being honest probably doesn’t take all that many parameters to encode. But this is pure speculation on my part.
This could be interesting to test as future work.
I’m curious because we have a bunch of methods now to interpret and let models introspect on steering vectors, which might produce mildly interesting results. According to the paper, it’s also easy (with the right injection point) to tell if a LoRA for a layer is basically a conditional steering vector or set of steering vectors—you can just look at the weight row PCA.
It would also be interesting to see if the steering vectors you extract looks like a persona vector https://www.anthropic.com/research/assistant-axis. Or maybe baseline our method with steering using out of the box “honest” persona vectors.
My intuition (i.e. wild-assed guess) is that a steering vector is probably too small to be optimal, but the optimal size of LoRA might be fairly small. An interesting thing to test.