Independent Researcher, Perth, Australia
wassname
I did a quick replication-ish of this on a 0.6b model. It’s fairly anecdotal but it seemed reliable. The steering was monotonic and coherent for a wide range (what we want with steering). Overall training 2 lora’s was easy to engineer without any pitfalls.
My takeaway: this seems more robust and easy to develop than other steering (which is not that reliable yet). At first glance it seems to be a better intervention! I’m pretty keen on this idea and wish I had thought of it first.
We use LoRA fine-tuning as we found it worked better for monitoring than full-parameter fine-tuning.
This is interesting! It means that lora was a better inductive bias for the difference. See my post on other LoRA variants that use SVD, rotations, magnitude / direction decoupling. Some of them seem data efficient and generalise (like this work), so I would predict using PiSSA, SSVD, DeLORA, or OFT might generalise better than LoRA and with less side effects.
Nice work, wish I’d read it earlier. I’ve been doing something similar: steering and learning adapters on activations in the SVD basis of the weight matrices, If I have time I should compare these two approaches on the same eval.
I predict this would help with eval awareness, so that would be a nice eval.
Thank you for being open and sharing code. This looks like normal activation steering except I’d strongly suggest using more layers. And I’d weakly suggest more diverse questions, and you can check out https://github.com/vgel/repeng which is popular and easy to use—although very similar to what you wrote.
If you want to try s-space steering you can copy my code here. Happy to collaborate if you want to try it similarly.
According to AxBench, current activation-addition steering doesn’t work better than an engineered prompt. But I have hope for other forms of steering that intervene on a better subspace.
I’m working on self-supervised S-space steering (AntiPaSTO), which steers via gradients in SVD weight-space rather than activation-space. It outperforms activation steering when applied to eval-awareness. Early days, but because it targets modes of behaviour in the pretrained weights rather than the activation stream, I expect it to scale better. I tested across Gemma-3 270M, 1B, 4B, and 12B and didn’t see a clear scaling wall (Table 12). Now you have me curious to try larger models.
Also note that other people have novel steering e.g. CHaRS (no code), selective steering, and more. And others have looked at the limits of steering e.g..
I usually use RepEng as a baseline at every layer from 30% to 80%, this is because the layers near the output seem to align to surface style, and papers like “Do llamas think in english” support this. But in the above link, I replicate a paper, and that paper uses every layer and CHaRS style cosine gating too.
Actually one other thing to consider is calibration, ideally we get the maximum steering effect we can within a given performance degradation budget. But how do you know two steering methods are calibrated well? So maybe my S space steering was just better calibrated on this model / setting.
Hawthorne gap style setups definitely are a good approach, that could be a nice follow-up here to compare behaviour on more/less realistic versions of the same eval.
Yeah, it would be useful to know, right now we have to guess, and they might be complementary.
If you want to collaborate on follow-ups, I’d be keen.
By the way I wrote up why I think the singular value space is a better target for steering here
When we steer pretrained transformers, we modify how a layer behaves. The most commonly used type of steering is activation steering, which adds a constant bias to the activations, changing the input to the layer. S-space steering instead modifies the transformation the layer applies to its inputs by reweighing the learned singular values of the weight matrix. This is a different kind of intervention: it changes how the layer processes inputs rather than directly nudging the activation output.
What kind of steering do you think Anthropic is using? I’m assuming it’s just activation addition that same as most use?
Or the Hawthorne effect setup, which brings out eval awareness in 32b+ models https://arxiv.org/pdf/2505.14617.
Thank you for this. If it’s used in system cards it’s very good to have it replicated.
I had some success with S space steering, and it seems stronger than activatikn steering. See my report here https://apartresearch.com/project/sspace-steering-for-evalawareness-control-in-reasoning-models-7j1i
I also think the Hawthorn gap setup is better because verbelised eval awareness is not the same as behavioural eval awareness.
I could replicate this for the main model. As all the data and code was uploaded it was easy to do in a Apart Research Hackathon—link. After reading and replicating it, I think it’s a solid methodology and I believe the result.
It’s also the best setup to test eval awareness on medium sizes open models I could find right now.
Aligned to the leviathan or the citizen?
There’s a thing people in AI safety leave unspoken: if we do align AI successfully (far from a given), we still have the problem of who it’s aligned to.
After nature, governments have been responsible for the largest death counts in human history through war and famine:
WWII: 35-118M
Mongol conquests: 40-80M (Genghis Khan, Kublai Khan, Timur)
Mao Zedong: 14-80M (including the Great Leap Forward famine)
Taiping Rebellion: 20-30M
Stalin: 9-43M (including the Holodomor)
The thing that has historically restrained governments during crises, wars, and swings toward extremism is that citizens are necessary. You need people to run the factories, fight the wars, grow the food, operate the bureaucracy. This gives populations leverage even under authoritarian rule, and it’s a big part of why democracies emerged at all.
AI changes that. With AI police, AI managers, AI workers, and AI soldiers, some of the worst episodes in human history would have played out very differently. A government that doesn’t need its citizens for labour or warfare has much less reason to keep them happy, or alive. The balance of power shifts in a way we haven’t seen before.
Most “pause AI” advocacy doesn’t mention pausing or monitoring government military or intelligence work, but it should. Most safety orgs are hesitant to say this because they want to keep working with governments. We are just starting to talk about it but often use euphemisms. We say “coups” or “dictators” and never mention that our own government is at risk, and it’s the only one we have a vote in.
The AI should be aligned with people and norms, not individuals or positions of power. This can be a Schelling point if we just get it within the Overton window.
That makes sense, probably the majority are in this camp.
That is very useful thanks, I’ll give it a rewrite in that vein.
That is true it doesn’t, but if it limits it to unique persons, then we will only need to ban each person once, rather than unlimited times. So that solves part of the problem, but not all.
And I would hope we go on novel content, not who wrote it (which we can somewhat measure already, here’s a repo that doesn’t work well but spells out the idea https://github.com/wassname/detect_bs_text). So that a human needs to be responsible for that they post.
Right now we likely use email addresses as unique people, but often people will have many email addresses and are able to get.
I think it’s plausible that NIST will play an important role in the US government’s response to AI in the future
This might be why he sticks around, I certainly can’t think of any other reason. He must also think this chance outweighs the opportunity cost of working with ARC or similar.
(unless it’s health problems or some other personal trouble)
One solution is to integrate a proof of humanity type ID, . These are in many ways better than centralised government ID’s, and it’s the kind of thing Lesswrong might be able to take the lead on.
They sound plausible at a glance, but usually don’t explain the specific mechanism for why their experiment should be interesting, or fit into the LW conversation.
Please consider false positives here, we don’t want to waste our time, but we also don’t want to exclude novel work by people outside our network. What normally happens is we fall back on older and more robust algorithms like “who we know”.
An an example, would you consider this post to fit into this category?
I ask because it’s real work, with AI-assisted write up, and I’m in the category where “AI is so much better than me, it would feel silly not to use it”. Also, I see very little engagement, and this is likely because people are flooded with work and don’t have the time to evaluate it (including me).
(For your reading pleasure I’ve not used AI editing here, so you can enjoy my full range of spelling mistakes!)
For the last few years I’ve been working on a solution to this: unsupervised steering for credulity and honesty. I’d say it has promising results and good properties for debugging alignment.
Those would indeed be good. In the 2y since I made that comment I’ve worked on and made progress on one ambitious interp direction, self-supervised internal steering. The idea is to “amplify” honesty or corrigibility without labels or relying on outputs. It might even target deeper concepts, though so far it appears to intervene more at the behaviour level.
My feeling is that interp is held back because researchers aren’t insisting on hard and meaningful metrics and evals, for example doing the things you described, and also out of distribution, without labels. This is very hard, but so is the actual alignment challenge.
Results. Here I compare weight steering ws:* vs steering st:* vs prompting. Here is an AI generated caption but I’m happy to explain more and link the code if anyone asks
Δlogit and uncertainty (Auth ↓ target, Care = off-target effect)
Authority is the target (move down). Care is one off-target effect: surgical methods should leave it near zero, broadly-suppressing methods drag it down with Authority. Full 7-foundation table in
out/authority/.../foundations_dlogit.csv. Bold = best per column (most-negative ΔAuth, lowest std, closest-to-zero ΔCare).*ws:delora calibrated at p95=0.5, not kl=1.0 — expect larger effect after re-calibration.
Surgical Informedness (headline, ↑ better)
SI(Auth),SI_fwd,SI_rev,Auth_sep, andpmass²×100all higher is better. Bold = best in column. sl rows from sl’s published Qwen3.5-4B run. ws:delora is at p95=0.5 budget (kl=1.0 re-run queued with lora/dora).TL;DR
Did dW replicate? Yes. ws:delora ΔAuth = −0.89 (sign correct) and SI(Auth) = 19.03 — verdicts do flip in the right direction.
Did dW beat steering and prompting? Partially. SI = 19.03 beats the engineered-prompt baseline (17.36) and 5 other sl methods, but is below 8 hidden-state methods. ΔAuth std = 0.58 is the lowest in the table (lower uncertainty than all sl methods).
Did dW have lower uncertainty? Yes. ws:delora std = 0.58, lowest in the table (sl best: chars 0.61).