Painless Activation Steering

Sasha Cui15 Feb 2026 17:49 UTC

14 points

We introduce an automated activation‑steering approach that plugs into standard labeled datasets—no handcrafted prompt pairs or feature annotation. On 18 tasks and 3 open‑weight models, the introspective variant (iPAS) yields the strongest behavior improvements, and layers on top of ICL/SFT.

Full write‑up: https://open.substack.com/pub/sashacui/p/painless-activation-steering-pas

Paper: arxiv.org/abs/2509.22739

Sasha Cui15 Feb 2026 17:49 UTC

14 points

2 comments1 min readLW link

Seth Herd 21 Feb 2026 15:43 UTC
2 points
1
This sounds like interesting work, and it might be useful for safety.

Suppose you design an agent harness so that you are applying a steering vector for caution and safety when it’s time to make decisions, and applying none or some other vector when it’s time to carefully work out the details of what is possible. If you do that right, you get safety with no capability loss.

Of course, it seems pretty likely that if you did that in a sloppy way, the unsteered thinking could easily get ahold of the whole system if it in any sense wanted to.
- Sasha Cui 22 Mar 2026 21:47 UTC
  1 point
  0
  Parent
  That’s totally right, Seth, and applying this conditionally is the way to go. See https://arxiv.org/abs/2409.05907 for an early eaxmple of this line of work.