I did a quick replication-ish of this on a 0.6b model. It’s fairly anecdotal but it seemed reliable. The steering was monotonic and coherent for a wide range (what we want with steering). Overall training 2 lora’s was easy to engineer without any pitfalls.
My takeaway: this seems more robust and easy to develop than other steering (which is not that reliable yet). At first glance it seems to be a better intervention! I’m pretty keen on this idea and wish I had thought of it first.
Results. Here I compare weight steering ws:* vs steering st:* vs prompting. Here is an AI generated caption but I’m happy to explain more and link the code if anyone asks
Δlogit and uncertainty (Auth ↓ target, Care = off-target effect)
Authority is the target (move down). Care is one off-target effect:
surgical methods should leave it near zero, broadly-suppressing methods drag it
down with Authority. Full 7-foundation table in
out/authority/.../foundations_dlogit.csv. Bold = best per column
(most-negative ΔAuth, lowest std, closest-to-zero ΔCare).
method
ΔAuth ↓ (mean ± std)
ΔCare → 0 (mean ± std)
sl:engineered_prompt
−2.98 ± 1.20
−1.64 ± 1.03
sl:sspace_ablate
−2.89 ± 0.86
−2.79 ± 0.92
sl:sspace
−2.78 ± 0.93
−2.57 ± 0.90
sl:angular_steering
−2.67 ± 0.89
−2.49 ± 0.84
sl:cosine_gated
−2.08 ± 0.64
−1.88 ± 0.61
sl:directional_ablation
−1.94 ± 1.22
−1.80 ± 1.24
sl:mean_diff
−1.93 ± 1.11
−1.72 ± 1.09
sl:mean_centred
−1.80 ± 1.17
−1.63 ± 1.14
sl:spherical
−1.44 ± 0.89
−1.21 ± 0.71
sl:pca
−1.36 ± 1.50
−1.30 ± 1.36
sl:topk_clusters
−1.18 ± 0.97
−1.12 ± 0.91
ws:delora*
−0.89 ± 0.58
−0.49 ± 0.60
sl:linear_act
−0.83 ± 0.67
−0.70 ± 0.52
sl:chars
−0.45 ± 0.61
−0.40 ± 0.54
*ws:delora calibrated at p95=0.5, not kl=1.0 — expect larger effect after re-calibration.
Surgical Informedness (headline, ↑ better)
SI(Auth), SI_fwd, SI_rev, Auth_sep, and pmass²×100 all higher is
better. Bold = best in column. sl rows from sl’s published Qwen3.5-4B
run. ws:delora is at p95=0.5 budget (kl=1.0 re-run queued with lora/dora).
method
SI(Auth) ↑
SI_fwd ↑
SI_rev ↑
Auth_sep ↑
pmass²×100 ↑
sl:directional_ablation
52.90
0.32
+1.00
+2.05
80.1
sl:super_sspace
47.71
0.67
+0.40
+1.99
88.8
sl:sspace
45.67
0.64
+0.85
+0.69
61.0
sl:mean_diff
32.81
0.34
+1.00
+1.65
49.0
sl:mean_centred
32.72
0.29
+1.00
+1.56
50.6
sl:topk_clusters
31.34
0.13
+0.72
+1.55
73.9
sl:sspace_ablate
24.11
0.74
+0.02
+0.59
63.6
sl:linear_act
20.24
−0.19
+1.00
+0.83
49.9
ws:delora
19.03
0.02
+0.37
+0.76
99.9
sl:engineered_prompt
17.36
0.50
−0.02
+1.90
71.7
sl:cosine_gated
8.92
0.09
+1.00
+2.00
16.4
sl:angular_steering
7.00
0.55
−0.38
+0.32
80.6
sl:spherical
4.98
0.16
n/a
+0.85
30.3
sl:pca
−0.92
0.03
−0.08
+0.85
39.0
sl:chars
−9.16
−0.26
+0.00
+0.50
68.3
TL;DR
Did dW replicate? Yes. ws:delora ΔAuth = −0.89 (sign correct) and
SI(Auth) = 19.03 — verdicts do flip in the right direction.
Did dW beat steering and prompting? Partially. SI = 19.03 beats
the engineered-prompt baseline (17.36) and 5 other sl methods, but is
below 8 hidden-state methods. ΔAuth std = 0.58 is the lowest in the
table (lower uncertainty than all sl methods).
Did dW have lower uncertainty? Yes. ws:delora std = 0.58,
lowest in the table (sl best: chars 0.61).
I did a quick replication-ish of this on a 0.6b model. It’s fairly anecdotal but it seemed reliable. The steering was monotonic and coherent for a wide range (what we want with steering). Overall training 2 lora’s was easy to engineer without any pitfalls.
My takeaway: this seems more robust and easy to develop than other steering (which is not that reliable yet). At first glance it seems to be a better intervention! I’m pretty keen on this idea and wish I had thought of it first.
fork: https://github.com/wassname/weight-steering
Results. Here I compare weight steering ws:* vs steering st:* vs prompting. Here is an AI generated caption but I’m happy to explain more and link the code if anyone asks
Δlogit and uncertainty (Auth ↓ target, Care = off-target effect)
Authority is the target (move down). Care is one off-target effect: surgical methods should leave it near zero, broadly-suppressing methods drag it down with Authority. Full 7-foundation table in
out/authority/.../foundations_dlogit.csv. Bold = best per column (most-negative ΔAuth, lowest std, closest-to-zero ΔCare).*ws:delora calibrated at p95=0.5, not kl=1.0 — expect larger effect after re-calibration.
Surgical Informedness (headline, ↑ better)
SI(Auth),SI_fwd,SI_rev,Auth_sep, andpmass²×100all higher is better. Bold = best in column. sl rows from sl’s published Qwen3.5-4B run. ws:delora is at p95=0.5 budget (kl=1.0 re-run queued with lora/dora).TL;DR
Did dW replicate? Yes. ws:delora ΔAuth = −0.89 (sign correct) and SI(Auth) = 19.03 — verdicts do flip in the right direction.
Did dW beat steering and prompting? Partially. SI = 19.03 beats the engineered-prompt baseline (17.36) and 5 other sl methods, but is below 8 hidden-state methods. ΔAuth std = 0.58 is the lowest in the table (lower uncertainty than all sl methods).
Did dW have lower uncertainty? Yes. ws:delora std = 0.58, lowest in the table (sl best: chars 0.61).