And in some cases, the “cleanest” way to modify its behavior is by expressing the change in terms of its learned concepts, whose representations are sculpted by exaflops of compute.
After all, what is an activation steering vector but a weirdly-constructed LoRA with rank 1[1]?
Ok technically they’re not equivalent because LoRAs operate in an input-dependent fashion on activations, while activation steering operates in an input-independent fashion on the activations. But LLMs very consistently have outlier directions in activation space with magnitudes that are far larger than “normal” directions and approximately constant across inputs. LoRA adds \(AB^Tx\\) to the activations. With r=1, you can trivially make BT aligned to the outlier dimension, which allows you to make BTx a scalar with value ≈ 1 (±0.06), which you can project to a constant direction in activation space with A. So given a steering vector, you can in practice make a *basically* equivalent but worse LoRA[2] in the models that exist today.
Don’t ask me how this even came up, and particularly don’t ask me what I was trying to do with serverless bring-your-own-lora inference. If you find yourself going down this path, consider your life choices. This way lies pain. See if you can just use goodfire.
Several such APIs exist. My thought was “I’d like to play with the llamascope SAE features without having to muck about with vllm, and together lets you upload a LoRA directly”, and I failed to notice that the SAE was for the base model and together only supports LoRAs for the instruct model.
The fun thing about this LoRA hack is that you don’t actually have to train the LoRA, if you know the outlier direction+magnitude for your model and the activation addition you want to apply you can write straight to the weights. The unfun thing is that it’s deeply cursed and also doesn’t even save you from having to mess with vllm.
Edit: on reflection, I do think rank 1 LoRAs might be an underappreciated interpretability tool.
After all, what is an activation steering vector but a weirdly-constructed LoRA with rank 1[1]?
Ok technically they’re not equivalent because LoRAs operate in an input-dependent fashion on activations, while activation steering operates in an input-independent fashion on the activations. But LLMs very consistently have outlier directions in activation space with magnitudes that are far larger than “normal” directions and approximately constant across inputs. LoRA adds \(AB^Tx\\) to the activations. With r=1, you can trivially make BT aligned to the outlier dimension, which allows you to make BTx a scalar with value ≈ 1 (±0.06), which you can project to a constant direction in activation space with A. So given a steering vector, you can in practice make a *basically* equivalent but worse LoRA[2] in the models that exist today.
Don’t ask me how this even came up, and particularly don’t ask me what I was trying to do with serverless bring-your-own-lora inference. If you find yourself going down this path, consider your life choices. This way lies pain. See if you can just use goodfire.
Tinker is an API for LoRA PEFT. You don’t mention it directly, but it’s trendy enough that I thought your comment was a reference to it.
Several such APIs exist. My thought was “I’d like to play with the llamascope SAE features without having to muck about with vllm, and together lets you upload a LoRA directly”, and I failed to notice that the SAE was for the base model and together only supports LoRAs for the instruct model.
The fun thing about this LoRA hack is that you don’t actually have to train the LoRA, if you know the outlier direction+magnitude for your model and the activation addition you want to apply you can write straight to the weights. The unfun thing is that it’s deeply cursed and also doesn’t even save you from having to mess with vllm.
Edit: on reflection, I do think rank 1 LoRAs might be an underappreciated interpretability tool.