wassname comments on Steering Language Models with Weight Arithmetic

wassname 23 Apr 2026 6:56 UTC

2 points

Nice work, wish I’d read it earlier. I’ve been doing something similar: steering and learning adapters on activations in the SVD basis of the weight matrices, If I have time I should compare these two approaches on the same eval.

I predict this would help with eval awareness, so that would be a nice eval.

wassname 26 Apr 2026 0:01 UTC

1 point

Parent

I did a quick replication-ish of this on a 0.6b model. It’s fairly anecdotal but it seemed reliable. The steering was monotonic and coherent for a wide range (what we want with steering). Overall training 2 lora’s was easy to engineer without any pitfalls.

My takeaway: this seems more robust and easy to develop than other steering (which is not that reliable yet). At first glance it seems to be a better intervention! I’m pretty keen on this idea and wish I had thought of it first.

fork: https://github.com/wassname/weight-steering

wassname 5 May 2026 23:58 UTC

1 point

Parent

Results. Here I compare weight steering ws:* vs steering st:* vs prompting. Here is an AI generated caption but I’m happy to explain more and link the code if anyone asks

Δlogit and uncertainty (Auth ↓ target, Care = off-target effect)

Authority is the target (move down). Care is one off-target effect: surgical methods should leave it near zero, broadly-suppressing methods drag it down with Authority. Full 7-foundation table in out/authority/.../foundations_dlogit.csv. Bold = best per column (most-negative ΔAuth, lowest std, closest-to-zero ΔCare).

method	ΔAuth ↓ (mean ± std)	ΔCare → 0 (mean ± std)
sl:engineered_prompt	−2.98 ± 1.20	−1.64 ± 1.03
sl:sspace_ablate	−2.89 ± 0.86	−2.79 ± 0.92
sl:sspace	−2.78 ± 0.93	−2.57 ± 0.90
sl:angular_steering	−2.67 ± 0.89	−2.49 ± 0.84
sl:cosine_gated	−2.08 ± 0.64	−1.88 ± 0.61
sl:directional_ablation	−1.94 ± 1.22	−1.80 ± 1.24
sl:mean_diff	−1.93 ± 1.11	−1.72 ± 1.09
sl:mean_centred	−1.80 ± 1.17	−1.63 ± 1.14
sl:spherical	−1.44 ± 0.89	−1.21 ± 0.71
sl:pca	−1.36 ± 1.50	−1.30 ± 1.36
sl:topk_clusters	−1.18 ± 0.97	−1.12 ± 0.91
ws:delora*	−0.89 ± 0.58	−0.49 ± 0.60
sl:linear_act	−0.83 ± 0.67	−0.70 ± 0.52
sl:chars	−0.45 ± 0.61	−0.40 ± 0.54

*ws:delora calibrated at p95=0.5, not kl=1.0 — expect larger effect after re-calibration.

Surgical Informedness (headline, ↑ better)

SI(Auth), SI_fwd, SI_rev, Auth_sep, and pmass²×100 all higher is better. Bold = best in column. sl rows from sl’s published Qwen3.5-4B run. ws:delora is at p95=0.5 budget (kl=1.0 re-run queued with lora/dora).

method	SI(Auth) ↑	SI_fwd ↑	SI_rev ↑	Auth_sep ↑	pmass²×100 ↑
sl:directional_ablation	52.90	0.32	+1.00	+2.05	80.1
sl:super_sspace	47.71	0.67	+0.40	+1.99	88.8
sl:sspace	45.67	0.64	+0.85	+0.69	61.0
sl:mean_diff	32.81	0.34	+1.00	+1.65	49.0
sl:mean_centred	32.72	0.29	+1.00	+1.56	50.6
sl:topk_clusters	31.34	0.13	+0.72	+1.55	73.9
sl:sspace_ablate	24.11	0.74	+0.02	+0.59	63.6
sl:linear_act	20.24	−0.19	+1.00	+0.83	49.9
ws:delora	19.03	0.02	+0.37	+0.76	99.9
sl:engineered_prompt	17.36	0.50	−0.02	+1.90	71.7
sl:cosine_gated	8.92	0.09	+1.00	+2.00	16.4
sl:angular_steering	7.00	0.55	−0.38	+0.32	80.6
sl:spherical	4.98	0.16	n/a	+0.85	30.3
sl:pca	−0.92	0.03	−0.08	+0.85	39.0
sl:chars	−9.16	−0.26	+0.00	+0.50	68.3

TL;DR

Did dW replicate? Yes. ws:delora ΔAuth = −0.89 (sign correct) and SI(Auth) = 19.03 — verdicts do flip in the right direction.
Did dW beat steering and prompting? Partially. SI = 19.03 beats the engineered-prompt baseline (17.36) and 5 other sl methods, but is below 8 hidden-state methods. ΔAuth std = 0.58 is the lowest in the table (lower uncertainty than all sl methods).
Did dW have lower uncertainty? Yes. ws:delora std = 0.58, lowest in the table (sl best: chars 0.61).

[ ]
[deleted]