We introduce an automated activation‑steering approach that plugs into standard labeled datasets—no handcrafted prompt pairs or feature annotation. On 18 tasks and 3 open‑weight models, the introspective variant (iPAS) yields the strongest behavior improvements, and layers on top of ICL/SFT.
Full write‑up: https://open.substack.com/pub/sashacui/p/painless-activation-steering-pas
Paper: arxiv.org/abs/2509.22739
This sounds like interesting work, and it might be useful for safety.
Suppose you design an agent harness so that you are applying a steering vector for caution and safety when it’s time to make decisions, and applying none or some other vector when it’s time to carefully work out the details of what is possible. If you do that right, you get safety with no capability loss.
Of course, it seems pretty likely that if you did that in a sloppy way, the unsteered thinking could easily get a hold of the whole system if it in any sense wanted to.