Nina Panickssery comments on Nina Panickssery’s Shortform

Nina Panickssery 29 Oct 2025 20:33 UTC
14 points
4
Whenever I read yet another paper or discussion of activation steering to modify model behavior, my instinctive reaction is to slightly cringe at the naiveté of the idea. Training a model to do some task only to then manually tweak some of the activations or weights using a heuristic-guided process seems quite un-bitter-lesson-pilled. Why not just directly train for the final behavior you want—find better data, tweak the reward function, etc.?
But actually there may be a good reason to continue working on model-internals control (i.e. ways of influencing model behavior outside of modifying the text input or training process, by directly changing internal state). For some applications, you may want to express something in terms of the model’s own abstractions, something that you won’t know a priori how to do in text or via training data in fine-tuning. Throughout the training process, a model naturally learns a rich semantic activation space. And in some cases, the “cleanest” way to modify its behavior is by expressing the change in terms of its learned concepts, whose representations are sculpted by exaflops of compute.
- Daniel Kokotajlo 29 Oct 2025 20:54 UTC
  9 points
  6
  Parent
  I always thought the point of activation steering was for safety/alignment/interpretability/science/etc., not capabilities.
  - Nina Panickssery 29 Oct 2025 22:05 UTC
    6 points
    2
    Parent
    Not sure what distinction you’re making. I’m talking about steering for controlling behavior in production, not for red-teaming at eval time or to test interp hypotheses via causal interventions. However this still covers both safety (e.g. “be truthful”) and “capabilities” (e.g. “write in X style”) interventions.
    - Daniel Kokotajlo 30 Oct 2025 17:03 UTC
      2 points
      0
      Parent
      Well, mainly I’m saying that “Why not just directly train for the final behavior you want” is answered by the classic reasons why you don’t always get what you trained for. (The mesaoptimizer need not have the same goals as the optimizer; the AI agent need not have the same goals as the reward function, nor the same goals as the human tweaking the reward function.) Your comment makes more sense to me if interpreted as about capabilities rather than about those other things.
- eggsyntax 30 Oct 2025 3:59 UTC
  4 points
  0
  Parent
  For some applications, you may want to express something in terms of the model’s own abstractions
  It seems like this applies to some kinds of activation steering (eg steering on SAE features) but not really to others (eg contrastive prompts); curious whether you would agree.
  - Nina Panickssery 30 Oct 2025 4:46 UTC
    4 points
    0
    Parent
    Perhaps. I see where you are coming from. Though I think it’s possible contrastive-prompt-based vectors (eg. CAA) also approximate “natural” features better than training on those same prompts (fewer degrees of freedom with the correct inductive bias). I should check whether there has been new research on this…
    - eggsyntax 30 Oct 2025 4:55 UTC
      2 points
      0
      Parent
      Thanks! If you find research that addresses that question, I’d be interested to know about it.
- faul_sname 30 Oct 2025 0:21 UTC
  2 points
  0
  Parent
  And in some cases, the “cleanest” way to modify its behavior is by expressing the change in terms of its learned concepts, whose representations are sculpted by exaflops of compute.
  After all, what is an activation steering vector but a weirdly-constructed LoRA with rank 1^[1]?
  1. ^
    Ok technically they’re not equivalent because LoRAs operate in an input-dependent fashion on activations, while activation steering operates in an input-independent fashion on the activations. But LLMs very consistently have outlier directions in activation space with magnitudes that are far larger than “normal” directions and approximately constant across inputs. LoRA adds $AB^Tx\$ to the activations. With r=1, you can trivially make $B^{T}$ aligned to the outlier dimension, which allows you to make $B^{T} x$ a scalar with value $\approx$ 1 (±0.06), which you can project to a constant direction in activation space with $A$ . So given a steering vector, you can in practice make a *basically* equivalent but worse LoRA^[2] in the models that exist today.
  2. ^
    Don’t ask me how this even came up, and particularly don’t ask me what I was trying to do with serverless bring-your-own-lora inference. If you find yourself going down this path, consider your life choices. This way lies pain. See if you can just use goodfire.
  - Sheikh Abdur Raheem Ali 30 Oct 2025 8:23 UTC
    1 point
    0
    Parent
    Tinker is an API for LoRA PEFT. You don’t mention it directly, but it’s trendy enough that I thought your comment was a reference to it.
    - faul_sname 30 Oct 2025 10:25 UTC
      3 points
      0
      Parent
      Several such APIs exist. My thought was “I’d like to play with the llamascope SAE features without having to muck about with vllm, and together lets you upload a LoRA directly”, and I failed to notice that the SAE was for the base model and together only supports LoRAs for the instruct model.
      
      The fun thing about this LoRA hack is that you don’t actually have to train the LoRA, if you know the outlier direction+magnitude for your model and the activation addition you want to apply you can write straight to the weights. The unfun thing is that it’s deeply cursed and also doesn’t even save you from having to mess with vllm.
      
      Edit: on reflection, I do think rank 1 LoRAs might be an underappreciated interpretability tool.