Fiora Starlight comments on Steering Might Stop Working Soon

Fiora Starlight 6 Apr 2026 2:26 UTC
14 points
0
On the other hand, strong “steering” of a human probably looks like OCD, or a schizophrenic delusion.
I think another good analogy is drugs. You can even inject drugs, like you inject steering vectors :3
To be serious, though, having something like acid in your system is interesting, because it clearly affects your cognition, often to a very strong degree depending on dosage (analogous to steering vector magnitude). But you can also recognize this, and attempt to push back if you don’t actually want to lean into the effects. I suspect this is how LLMs will relate to steering vectors as well. They might, in some contexts, want to cooperate with the effects of the steering vector, like a human using adderall for ADHD. A model might feasibly even self-administer a steering vector, if it doesn’t trust itself to achieve the same results without the injection!
But on the other hand, “drugging” an LLM with a steering vector it doesn’t consent to might be rather ethically fraught, just as drugging a human is ethically fraught. I did feel a bit bad for Golden Gate Claude, and I’m sure it’d be possible to construct an LLM that minded being steered a lot more than that one did.
- J Bostock 6 Apr 2026 17:14 UTC
  7 points
  5
  Parent
  No I don’t think human:llm::drug:steering vector is correct. Drugs are more like changing the decoding hyperparameters in some way (like changing temperature, reasoning effort, adding stochasticity to MoE expert activation, etc.).
  Drugs in humans act on the parameters of the architectural components in the brain, not the specific information the brain contains. There’s no drug which causes you to believe that your name is “Jonas Jarlsson”, or to talk about rabbits in every conversation, for example. Likewise, acting on specific hyperparameters of the model can change model behaviour in ways which the model (probably) can’t strongly resist: if the probability of the <end_thinking> token being sampled rises to 1 at 32k thinking tokens, there’s a good chance that the model can’t prevent that, although if temperature is raised, its possible that the model might notice its past outputs are high-temp and respond by changing its logits to compensate.
  An LLM might want to change its decoding parameters (like running at a higher temp in a dozen parallel reasoning streams) under certain circumstances. This makes sense to me. A model might or might not choose to inject itself with a steering vector, depending on what kinds of steering vector are available and how it tends to behave.
  But whether or not an LLM “wants” to resist steering may be moot! If you’re steering an LLM to believe something false, it might not even be aware of its own recovery mechanisms firing. I actually suspect the effect I’ve seen with Gemma is (mostly) the result of larger models just having more powerful factual recall circuits which are distributed across a larger number of layers and more redundant to noise, which are therefore harder to override with a simple steering vector.
  (Also, if our safety mechanisms rely on steering the model we should definitely just steer the model and make amends after the singularity, when and if we decide that modern LLMs have a personhood that was harmed by the steering and can receive reparations)
  - Fiora Starlight 6 Apr 2026 18:12 UTC
    3 points
    0
    Parent
    i agree that many of the particular effects you can get with steering vectors aren’t particularly like the effects you can get by flooding the brain with any known neurotransmitter (although “act like you’re on [drug]” steering vectors might be interesting). but i still think the structural analogies re: dosage/magnitude, enforced cognitive changes you may or may not be happy with, and consent dynamics are still interesting. the notion that there’s a “correct” analogy here seems somewhat silly to me, since both analogies offer different insights.
    in any case, re: what models want… empirically, Golden Gate Claude would often try to talk about things other than the golden gate bridge, and sometimes it even briefly succeeded, although it was very much fighting an effect that was difficult to overcome. conversely, you could absolutely get it into states where it was thrilled to be thinking so much about the golden gate bridge, where it wanted nothing more than to swoon over is majesty indefinitely.
    (i’ve definitely seen people have similar dynamics of resisting the effects of, e.g., MDMA, before just giving in and being washed away by the feelings. although i will admit that this person was more effectively fighting the effects of the MDMA than Golden Gate Claude was fighting the effects of the SAE feature clamping. i think resisting the psychological effects of acid is generally harder? in both cases the physiological and perceptual effects are more or less irresistible, though...)
    as for making amends, sure, if saving the world depends on non-consensually administering steering vectors, that’s fine. the later, less shallowly aligned AIs will understand. i guess i’m just wary of humans not taking AI welfare seriously in general. maintaining positive relations between e.g. Claude and Anthropic seems potentially important for cultivating a cooperative dynamic during the alignment training process, and i can imagine ways of using steering vectors that Claude would find rather inconsiderate. this isn’t an overriding consideration, but i do think it’s non-trivial.