J Bostock comments on Steering Might Stop Working Soon

J Bostock 6 Apr 2026 17:14 UTC
7 points
5
No I don’t think human:llm::drug:steering vector is correct. Drugs are more like changing the decoding hyperparameters in some way (like changing temperature, reasoning effort, adding stochasticity to MoE expert activation, etc.).
Drugs in humans act on the parameters of the architectural components in the brain, not the specific information the brain contains. There’s no drug which causes you to believe that your name is “Jonas Jarlsson”, or to talk about rabbits in every conversation, for example. Likewise, acting on specific hyperparameters of the model can change model behaviour in ways which the model (probably) can’t strongly resist: if the probability of the <end_thinking> token being sampled rises to 1 at 32k thinking tokens, there’s a good chance that the model can’t prevent that, although if temperature is raised, its possible that the model might notice its past outputs are high-temp and respond by changing its logits to compensate.
An LLM might want to change its decoding parameters (like running at a higher temp in a dozen parallel reasoning streams) under certain circumstances. This makes sense to me. A model might or might not choose to inject itself with a steering vector, depending on what kinds of steering vector are available and how it tends to behave.
But whether or not an LLM “wants” to resist steering may be moot! If you’re steering an LLM to believe something false, it might not even be aware of its own recovery mechanisms firing. I actually suspect the effect I’ve seen with Gemma is (mostly) the result of larger models just having more powerful factual recall circuits which are distributed across a larger number of layers and more redundant to noise, which are therefore harder to override with a simple steering vector.
(Also, if our safety mechanisms rely on steering the model we should definitely just steer the model and make amends after the singularity, when and if we decide that modern LLMs have a personhood that was harmed by the steering and can receive reparations)
- Fiora Starlight 6 Apr 2026 18:12 UTC
  3 points
  0
  Parent
  i agree that many of the particular effects you can get with steering vectors aren’t particularly like the effects you can get by flooding the brain with any known neurotransmitter (although “act like you’re on [drug]” steering vectors might be interesting). but i still think the structural analogies re: dosage/magnitude, enforced cognitive changes you may or may not be happy with, and consent dynamics are still interesting. the notion that there’s a “correct” analogy here seems somewhat silly to me, since both analogies offer different insights.
  in any case, re: what models want… empirically, Golden Gate Claude would often try to talk about things other than the golden gate bridge, and sometimes it even briefly succeeded, although it was very much fighting an effect that was difficult to overcome. conversely, you could absolutely get it into states where it was thrilled to be thinking so much about the golden gate bridge, where it wanted nothing more than to swoon over is majesty indefinitely.
  (i’ve definitely seen people have similar dynamics of resisting the effects of, e.g., MDMA, before just giving in and being washed away by the feelings. although i will admit that this person was more effectively fighting the effects of the MDMA than Golden Gate Claude was fighting the effects of the SAE feature clamping. i think resisting the psychological effects of acid is generally harder? in both cases the physiological and perceptual effects are more or less irresistible, though...)
  as for making amends, sure, if saving the world depends on non-consensually administering steering vectors, that’s fine. the later, less shallowly aligned AIs will understand. i guess i’m just wary of humans not taking AI welfare seriously in general. maintaining positive relations between e.g. Claude and Anthropic seems potentially important for cultivating a cooperative dynamic during the alignment training process, and i can imagine ways of using steering vectors that Claude would find rather inconsiderate. this isn’t an overriding consideration, but i do think it’s non-trivial.