Whenever I read yet another paper or discussion of activation steering to modify model behavior, my instinctive reaction is to slightly cringe at the naiveté of the idea. Training a model to do some task only to then manually tweak some of the activations or weights using a heuristic-guided process seems quite un-bitter-lesson-pilled. Why not just directly train for the final behavior you want—find better data, tweak the reward function, etc.?
But actually there may be a good reason to continue working on model-internals control (i.e. ways of influencing model behavior outside of modifying the text input or training process, by directly changing internal state). For some applications, you may want to express something in terms of the model’s own abstractions, something that you won’t know a priori how to do in text or via training data in fine-tuning. Throughout the training process, a model naturally learns a rich semantic activation space. And in some cases, the “cleanest” way to modify its behavior is by expressing the change in terms of its learned concepts, whose representations are sculpted by exaflops of compute.
Not sure what distinction you’re making. I’m talking about steering for controlling behavior in production, not for red-teaming at eval time or to test interp hypotheses via causal interventions. However this still covers both safety (e.g. “be truthful”) and “capabilities” (e.g. “write in X style”) interventions.
Well, mainly I’m saying that “Why not just directly train for the final behavior you want” is answered by the classic reasons why you don’t always get what you trained for. (The mesaoptimizer need not have the same goals as the optimizer; the AI agent need not have the same goals as the reward function, nor the same goals as the human tweaking the reward function.) Your comment makes more sense to me if interpreted as about capabilities rather than about those other things.
For some applications, you may want to express something in terms of the model’s own abstractions
It seems like this applies to some kinds of activation steering (eg steering on SAE features) but not really to others (eg contrastive prompts); curious whether you would agree.
Perhaps. I see where you are coming from. Though I think it’s possible contrastive-prompt-based vectors (eg. CAA) also approximate “natural” features better than training on those same prompts (fewer degrees of freedom with the correct inductive bias). I should check whether there has been new research on this…
And in some cases, the “cleanest” way to modify its behavior is by expressing the change in terms of its learned concepts, whose representations are sculpted by exaflops of compute.
After all, what is an activation steering vector but a weirdly-constructed LoRA with rank 1[1]?
Ok technically they’re not equivalent because LoRAs operate in an input-dependent fashion on activations, while activation steering operates in an input-independent fashion on the activations. But LLMs very consistently have outlier directions in activation space with magnitudes that are far larger than “normal” directions and approximately constant across inputs. LoRA adds \(AB^Tx\\) to the activations. With r=1, you can trivially make BT aligned to the outlier dimension, which allows you to make BTx a scalar with value ≈ 1 (±0.06), which you can project to a constant direction in activation space with A. So given a steering vector, you can in practice make a *basically* equivalent but worse LoRA[2] in the models that exist today.
Don’t ask me how this even came up, and particularly don’t ask me what I was trying to do with serverless bring-your-own-lora inference. If you find yourself going down this path, consider your life choices. This way lies pain. See if you can just use goodfire.
Several such APIs exist. My thought was “I’d like to play with the llamascope SAE features without having to muck about with vllm, and together lets you upload a LoRA directly”, and I failed to notice that the SAE was for the base model and together only supports LoRAs for the instruct model.
The fun thing about this LoRA hack is that you don’t actually have to train the LoRA, if you know the outlier direction+magnitude for your model and the activation addition you want to apply you can write straight to the weights. The unfun thing is that it’s deeply cursed and also doesn’t even save you from having to mess with vllm.
Edit: on reflection, I do think rank 1 LoRAs might be an underappreciated interpretability tool.
Whenever I read yet another paper or discussion of activation steering to modify model behavior, my instinctive reaction is to slightly cringe at the naiveté of the idea. Training a model to do some task only to then manually tweak some of the activations or weights using a heuristic-guided process seems quite un-bitter-lesson-pilled. Why not just directly train for the final behavior you want—find better data, tweak the reward function, etc.?
But actually there may be a good reason to continue working on model-internals control (i.e. ways of influencing model behavior outside of modifying the text input or training process, by directly changing internal state). For some applications, you may want to express something in terms of the model’s own abstractions, something that you won’t know a priori how to do in text or via training data in fine-tuning. Throughout the training process, a model naturally learns a rich semantic activation space. And in some cases, the “cleanest” way to modify its behavior is by expressing the change in terms of its learned concepts, whose representations are sculpted by exaflops of compute.
I always thought the point of activation steering was for safety/alignment/interpretability/science/etc., not capabilities.
Not sure what distinction you’re making. I’m talking about steering for controlling behavior in production, not for red-teaming at eval time or to test interp hypotheses via causal interventions. However this still covers both safety (e.g. “be truthful”) and “capabilities” (e.g. “write in X style”) interventions.
Well, mainly I’m saying that “Why not just directly train for the final behavior you want” is answered by the classic reasons why you don’t always get what you trained for. (The mesaoptimizer need not have the same goals as the optimizer; the AI agent need not have the same goals as the reward function, nor the same goals as the human tweaking the reward function.) Your comment makes more sense to me if interpreted as about capabilities rather than about those other things.
It seems like this applies to some kinds of activation steering (eg steering on SAE features) but not really to others (eg contrastive prompts); curious whether you would agree.
Perhaps. I see where you are coming from. Though I think it’s possible contrastive-prompt-based vectors (eg. CAA) also approximate “natural” features better than training on those same prompts (fewer degrees of freedom with the correct inductive bias). I should check whether there has been new research on this…
Thanks! If you find research that addresses that question, I’d be interested to know about it.
After all, what is an activation steering vector but a weirdly-constructed LoRA with rank 1[1]?
Ok technically they’re not equivalent because LoRAs operate in an input-dependent fashion on activations, while activation steering operates in an input-independent fashion on the activations. But LLMs very consistently have outlier directions in activation space with magnitudes that are far larger than “normal” directions and approximately constant across inputs. LoRA adds \(AB^Tx\\) to the activations. With r=1, you can trivially make BT aligned to the outlier dimension, which allows you to make BTx a scalar with value ≈ 1 (±0.06), which you can project to a constant direction in activation space with A. So given a steering vector, you can in practice make a *basically* equivalent but worse LoRA[2] in the models that exist today.
Don’t ask me how this even came up, and particularly don’t ask me what I was trying to do with serverless bring-your-own-lora inference. If you find yourself going down this path, consider your life choices. This way lies pain. See if you can just use goodfire.
Tinker is an API for LoRA PEFT. You don’t mention it directly, but it’s trendy enough that I thought your comment was a reference to it.
Several such APIs exist. My thought was “I’d like to play with the llamascope SAE features without having to muck about with vllm, and together lets you upload a LoRA directly”, and I failed to notice that the SAE was for the base model and together only supports LoRAs for the instruct model.
The fun thing about this LoRA hack is that you don’t actually have to train the LoRA, if you know the outlier direction+magnitude for your model and the activation addition you want to apply you can write straight to the weights. The unfun thing is that it’s deeply cursed and also doesn’t even save you from having to mess with vllm.
Edit: on reflection, I do think rank 1 LoRAs might be an underappreciated interpretability tool.