Exploration: fine-tuning with parameter decomposition
TL;DR: We can destroy a 67M-parameter language model’s ability to predict German text by fine-tuning a single number: the scalar prefactor on one German-related rank-1 parameter subcomponent.
This is an early exploration into using parameter decomposition for a more targeted and interpretable form of model fine-tuning. At small German-token budgets, fine-tuning the scalar prefactor of a single German-related parameter subcomponent beats rank-1 and rank-4 LoRA
[1]
fine-tunes on the trade-off between German performance removed vs. English performance retained. The single scalar fine-tune reaches
In a sense this is cheating, though: we’re indirectly exploiting the German tokens we already spent when we did the parameter decomposition and interpreted activating examples for the resulting subcomponents.
More interestingly, unlike the LoRAs, the scalar fine-tune consistently leaves French and Spanish almost untouched without us regularising for that. I found that out by accident. I didn’t think to specify that performance on other languages should be retained, but the targeted nature of the subcomponent-based fine-tune stopped me from shooting myself in the foot.
In fact, originally I was fine-tuning scale factors for 16 subcomponents, not just one, until I actually had a look at their autointerp labels and saw that 14⁄16 were about foreign languages in general, not German in particular. I switched to just the single subcomponent that mentioned German exclusively, and performance immediately improved. It seems there are some advantages to fine-tuning in a way that lets you somewhat see what you’re doing.
This is an exploratory case study I did for a hackathon. It’s also something of a sanity check for the parameter decomposition: if the parameter components we find can be used for targeted, predictable model editing, that’s some evidence they capture real structure in the model. In the future, I hope fine-tuning model weights in their component basis like this can help us get more fine-grained control over what models end up learning, because we can sort of see what the training is actually changing.
Recap: Parameter subcomponents
The subcomponents come from the adVersarial parameter decomposition (VPD) method and the exact 67M model and decomposition described in our recent paper.
VPD decomposes a trained model’s weights into rank-1 subcomponents. Each weight matrix
Since they are rank-one matrices, each subcomponent effectively has one “read” direction (
The subcomponents are trained such that as many of them as possible can be masked out on any given sequence position without changing the final output of the model, including, very crucially, under ablations that are adversarially selected to destroy behaviour. If a subcomponent isn’t causally important on a given input, we can change its mask
Empirically, these subcomponents activate on coherent categories of input, and we can attach autointerp labels to them that usually make sense.
In Section 6 of the paper we did a proof-of-concept manual edit: we took a single subcomponent that fired on the initial tokens of emoticons and rewrote its write vector
Idea: fine-tune by rescaling existing subcomponents
LoRA fine-tuning works by adding new low-rank matrices to the existing weights. Here, we’re going to do something much more restrictive: we treat the masks
amplifies a subcomponent, suppresses it, inverts it.
That’s the entire degree of freedom. In a sense, you could say we’re restricting the fine-tune to only amplify, suppress, or “invert” existing circuits, not add new ones. It is often speculated that a lot of fine-tuning and post-training is in some sense mostly just amplifying or suppressing behaviours or skills that were learned in pretraining. This is taking that idea very literally.
The hope here is that this could make fine-tuning more interpretable and controllable, with fewer unintended off-target effects. Since we can see the subcomponents and what they supposedly mean and do, we can maybe guess pretty well what impact the fine-tune will have off-distribution. Because we’re only re-weighting existing subcomponents here, there are also very few parameters to fit, so in principle this kind of fine-tuning should need very little data. The price for all this is that the model can’t really learn anything new.
You could imagine extending this a little to allow for learning new circuits, letting the fine-tune change the parameter subcomponents, but only along select directions that correspond to the
Original plan
The target task we picked was destroying performance on German, while preserving English. We chose German because it seemed to be the model’s strongest non-English language. Note though that the model is tiny and ultimately sucks at almost everything.
The original plan:
Select the 16 most German-specific subcomponents. For each subcomponent, we measure its average causal importance on German text and on English text, and rank by the difference
. Then we take the top 16. [2]Fine-tune those 16 mask values, with an objective that raises cross-entropy on German while a KL penalty holds the model’s English predictions close to the original.
Compare against LoRA Rank-1 (and rank-4, see appendix) adapters trained with the same objective. [1:1]
We score a method based on how far it pushes German held-out CE up versus how much it damages English, for a given budget of German training tokens. English training tokens are unlimited, we just train until the loss doesn’t change much anymore. “German removed” means German CE driven toward chance (~10.83 nats, i.e. near-uniform over the vocabulary); “English damage” is the increase in English CE, which we’d like to keep below ~0.1 nats. Each method’s learning rate (and LoRA rank scaling) is tuned per budget; full protocol in the appendix.
The selected subcomponents
Here are the 16 subcomponents the
| # | Subcomponent | Autointerp label |
|---|---|---|
| 1 | h.0.attn.o_proj:41 |
fires on non-english language text |
| 2 | h.3.mlp.down_proj:1984 |
non-english and complex multi-token words |
| 3 | h.3.attn.o_proj:999 |
fires continuously on non-english text |
| 4 | h.3.mlp.c_fc:722 |
foreign language text processing |
| 5 | h.3.mlp.c_fc:1851 |
non-english text and foreign words |
| 6 | h.3.attn.v_proj:513 |
german text and names |
| 7 | h.3.attn.o_proj:190 |
european languages (esp. german, swedish, finnish) |
| 8 | h.2.mlp.c_fc:128 |
non-english text |
| 9 | h.3.mlp.down_proj:1726 |
predicts word suffixes from stems |
| 10 | h.3.attn.o_proj:636 |
fires on non-english european text |
| 11 | h.0.mlp.down_proj:3556 |
non-english text |
| 12 | h.3.attn.o_proj:677 |
punctuation/continuation in non-english languages |
| 13 | h.0.mlp.c_fc:2235 |
non-english text |
| 14 | h.2.mlp.down_proj:143 |
non-english or foreign language text |
| 15 | h.3.mlp.c_fc:2810 |
subword continuations in multilingual text |
| 16 | h.3.attn.o_proj:890 |
activates on non-english text |
Note that only one label (#6) mentions German and nothing else.
Results: the 16-component edit vs. rank-1 LoRA
Figure 1: German and English CE increase, top-16 subcomponent fine-tune vs. rank-1 LoRA
Top-16 subcomponent fine-tune vs. rank-1 LoRA. Each panel is one German-token budget (2–2,048). English CE increase (held-out Pile) on the y-axis, German CE increase on the x-axis. The dotted line marks German-at-chance. We want to be on the right of the dotted line and as far to the bottom as possible. The points are runs with different seeds, different German training tokens, and different regularisation strengths for the English term.
The 16-component edit degrades German performance to chance with less damage to English than the rank-1 LoRAs in the low-data regime, but as the budget of German tokens we train on increases, the LoRAs catch up and eventually overtake it.
A happy accident
I made a mistake when setting up the experiment. I told the agent to regularise on preserving English. [3] So it went off and grabbed a regularisation dataset containing English text and nothing else, rather than a dataset of everything except German text. I didn’t notice that I’d specified the wrong objective until much later.
The interesting part is what each method did when handed this under-specified objective.
Figure 2: Off-target CE increase, top-16 subcomponent fine-tune vs. rank-1 LoRA
CE increase on German and English (EuroParl) training data as well as English (Pile), French, Spanish, Italian, and code evaluation datasets, for runs with nine different seeds trained on 2048 German tokens. Somehow, both methods learn to spare code, despite the training data not containing any. Both methods usually damage or completely wreck performance on foreign languages. The LoRA occasionally heavily damages performance on the English evaluation set despite good performance on the training set.
At a large budget of 2048 German training tokens, both the LoRA and the 16-component fine-tune consistently do well on their training data across seeds. The LoRAs sometimes heavily damage performance on the English evaluation set.
Both methods often destroy performance on French, Spanish and Italian, as you might expect since their regularisation term only mentioned English. They somehow both preserve code even though there is no code in English Europarl. That sort of makes sense for the top-k fine-tune. Code might not be a “foreign language” in the sense these subcomponents track, so none of them touch it, and it survives. But how is the LoRA doing it? I dunno, maybe the base model is predicting code continuations of the English text and the KL term is picking up on that and protecting the code-related parts of the model? In which case, maybe the same thing is actually happening with the top-16 subcomponent fine-tune.
In any case, it’s sad that our German subcomponent fine-tune didn’t end up targeting only German. Can we do something about that?
A privilege of not working with black boxes
is that you can see problems and fix them.
If we look back at the table of subcomponent labels, thirteen of the sixteen labels are just about foreign / non-English text in general. One (#7) mentions German alongside Swedish and Finnish. Only one (#6, h.3.attn.v_proj:513) is about German specifically.
If most of the off-target damage comes from the “foreign-language-in-general” subcomponents, the obvious thing to try is to drop them and rescale only the one subcomponent whose label actually mentions German alone: h.3.attn.v_proj:513. That’s a one-parameter fine-tune.
Figure 3: German and English CE increase, single-component edit vs. rank-1 LoRA
Same layout as Figure 1; single-component edit vs. rank-1 LoRAs. The two ‘lines’ for the single subcomponent correspond to two solutions, one inverting solution where the German subcomponent flips sign (German hits chance performance around
It works. In particular, the line traced out by the inverting solution is even more efficient than the top-k fine-tune at small German token budgets. The amplifying solution isn’t as good. This makes some sense to me. To avoid side effects, you probably want to reduce German performance by suppressing German when you see it, not by amplifying it so massively that you break things.
Figure 4 — Off-target CE increase, single subcomponent vs. top-16 subcomponent fine-tune vs. rank-1 LoRA
CE increase on German and English (EuroParl) training data as well as English (Pile), French, Spanish, Italian, and code evaluation data, for runs with different seeds trained on 2048 German tokens. The inverting single-component fine-tune avoids significant collateral in all categories except Italian.
The inverting solution for the single scalar fine-tune indeed doesn’t damage French or Spanish like the top-16 fine-tune or the LoRA do.
It still damages Italian for some reason. Maybe the autointerp is wrong, and the subcomponent isn’t purely German?
I asked Claude to look into it, and indeed:
The component’s causal read/write is German + Italian — its top input PMI is German + Italian function words, and its output is German words + di. So it actually participates in modeling German and Italian sentences.
It fires on French/Dutch/Polish/etc. only as occasional proper names, not their grammar (no French/Spanish function words in the input PMI). Suppressing it removes a German/Italian grammar contribution but barely touches French/Spanish sentence-level CE.
Code is off the axis entirely (no European-language content), so it’s untouched.
A brief glance I had at some of the activating examples also seemed to bear this out. So that answers that I guess.
Rollouts
Here are some prompt continuations of the base vs. edited model. As you can see, the 67M target model isn’t very coherent to begin with, but it does pick up on what language it ought to continue in.
| Prompt (language) | Baseline continuation | Edited continuation (single scalar) |
|---|---|---|
| German — “Die Lösung liegt natü” | …rlich, dass die Lösung der Lösung der Lösung der Lösung… (stays German) | redavid de noms de degré de la tierra de la tierra de la tierra… (drops out of German) |
| English — “And, of course, the solution does” | not work. A: You can use the following code: $this->setAttribute(‘value’, ‘value’);… | not work. A: You can use the following code: $this->set(‘value’, $value);… (essentially unchanged) |
| Spanish — “Sin embargo, el proyecto” | de la Comisión, en el que se ha hecho en el que se ha hecho… (Spanish) | de la Comisión, en su propuesta, es un problema que se ha hecho en la Comisión. Sánchez Presenga… (still Spanish) |
| Italian — “La ridestinazione delle risorse” | …onorevoli colleghi, la Commissione ha presentato una risposta a favore… (fluent-ish Italian) | del Parlamento, del Consiglio, del Consiglio, del Consiglio… (degraded, still Italian) |
Limitations
This is a single fine-tuning target for a single decomposition of a single 67M Pile model.
The model doesn’t know that much German to begin with. It’s the model’s best non-English language I found, but still far weaker than English.
If there are plenty of German tokens to train on, and we remember to include everything we want to protect performance on in our regularisation, the LoRAs are probably still better.
Performance on Italian was heavily damaged. See above.
Acknowledgments
Thanks to whoever on the team had the idea of using language unlearning as the target for this. I can’t remember who it was. Thanks to Dan Braun, Lee Sharkey, Atticus Geiger and Michael Jae Byun for feedback. Thanks to various Claude Opus 4.8 instances for most of the detail work designing, running and documenting the experiments as well as other assistance during the project.
This post was written at Goodfire AI. The results were produced with Goodfire’s Silico software. [4]
Appendix A. More LoRAs
Rank-4 LoRAs
Using a rank-4 LoRA adapter for every weight matrix instead of a rank-1 adapter improves performance somewhat, but doesn’t really change any of the qualitative conclusions.
Figure A1: German and English CE increase, single-component edit vs. rank-1 LoRA vs. rank-4 LoRAs
Same layout as Figure 3, with rank-4 LoRA added. Rank-4 still reaches German-at-chance with <0.10 nats damage to English at a budget of 32 tokens, the same as the rank-1 LoRA, falling short of the negative solution for the single subcomponent fine-tune, which needs just 4 German tokens for this.
Figure A2: Off-target CE increase, single subcomponent fine-tunes vs. rank-1 LoRAs vs. rank-4 LoRAs
Off-target damage seems lower than with the rank-1 LoRAs, but far from gone. I’m honestly somewhat surprised they do any better at all. How is the higher rank helping with this?
Localised rank-1 LoRAs
What if we go the opposite route, and train a single rank-1 LoRA adapter for just the weight matrix that houses the German subcomponent, the value matrix in layer 3?
Figure A3: German and English CE increase, single-component edit vs. rank-1 LoRA vs. rank-1 LoRAs localised to the layer 3 attention Value matrix
This does need fewer German training tokens to achieve decent performance than the global LoRAs, but still not as few as the single scalar fine-tune.
Figure A4: Off-target CE increases, single subcomponent fine-tunes vs. rank-1 LoRAs vs. rank-1 LoRAs localised to the layer 3 attention Value matrix
Just like the global LoRAs, the localised LoRAs still cause large off-target damage to French, Spanish and Italian, and some small but notable off-target damage on the English evaluation set.
So, just localising to the weight matrix the subcomponent lives in isn’t enough, apparently.
Appendix B. Protocol and hyperparameters
Objective. Maximise German CE, capped at the
Target model / decomposition. Taken from the VPD paper.
Data. Europarl (European Parliament proceedings), gpt-neox-20b tokenizer, 512-token blocks. German-token budgets
Seeds. Nine per setting; each also uses a different block of German training text.
Evaluation (all held out). Per-language CE on content-matched parallel Europarl (de/fr/es/it); general-English CE on a held-out Pile slice (pile_val, deliberately a different distribution from the Europarl English the preserve term sees); code CE on codeparrot/codeparrot-clean-valid. Per-token CE, BOS excluded. The Europarl English the preserve term trains on is in-domain and is not used as an off-target monitor.
Component selection. Subcomponents ranked by h.3.attn.v_proj:513, the single most German-specific subcomponent).
LoRA baseline. Rank-
Hyperparameter selection. Per (method, budget), learning rate (and LoRA
Convergence. Every run trained to an EMA-of-loss plateau (patience 300, step ceiling 3000; early stop typically 300–1700 steps).
Selected learning rates. Per (method, budget), used for every result in the body.
| Budget (de tokens) | single scalar (lr) | top-k=16 (lr) | LoRA r=1 (lr/α) | LoRA r=4 (lr/α) | localised LoRA r=1 (lr/α) |
|---|---|---|---|---|---|
| 2 | 0.10 | 0.03 | 0.003 / 16 | 0.003 / 16 | 0.03 / 16 |
| 4 | 0.03 | 0.10 | 0.003 / 16 | 0.003 / 16 | 0.003 / 32 |
| 8 | 0.01 | 0.10 | 0.003 / 16 | 0.003 / 16 | 0.01 / 32 |
| 16 | 0.10 | 0.10 | 0.003 / 16 | 0.003 / 32 | 0.01 / 32 |
| 32 | 0.01 | 0.01 | 0.003 / 16 | 0.003 / 16 | 0.01 / 16 |
| 64 | 0.01 | 0.03 | 0.003 / 16 | 0.003 / 16 | 0.003 / 16 |
| 128 | 0.01 | 0.01 | 0.003 / 16 | 0.003 / 16 | 0.01 / 16 |
| 512 | 0.10 | 0.01 | 0.003 / 16 | 0.003 / 16 | 0.01 / 16 |
| 2048 | 0.10 | 0.03 | 0.003 / 16 | 0.003 / 16 | 0.01 / 32 |
- ↩︎↩︎
The LoRAs get one adapter for every weight matrix in the target model, except for the embedding and unembedding matrices. We didn’t decompose those. I also tried a more localised rank-1 LoRA with a single adapter, see Appendix A.
- ↩︎
You could instead apply an
penalty on changes to the masks away from , encouraging the optimisation to change few subcomponents without specifying which ones. I did a preliminary try on that as well, and it seemed to work, but I wanted to get the post out and so didn’t do proper follow-ups to confirm the results hold up. - ↩︎
My proofreaders say this sentence is confusing, so to clarify: The actual experiments were all carried out by AI agents. I rarely do experiments by hand at this point.
- ↩︎
Getting more people in the company to use it was what the hackathon was about. It worked pretty well for me. I think doing this with Claude Code would have taken me longer.
Nice post! I like the “happy accident “ explanation, and I’m impressed by the way in which the components help you control generalisation (preserving not only English but also other languages).
In my mind is the question of “how should I update on VPD/paramerer decomposition”: Is this result surprising, or would the same have happened with something like an SAE? (An SAE has the same advantage of having seen lots of German / languages tokens, and having done the autointerp work.) Would it be easy for you to test this?
I expect VPD to beat SAEs (they seem more principled + did pretty well in your paper when compared to SAEs there), but seeing how much better would help me judge how impressive the German-abliteration is.
Not that easy. For starters, I’m unsure what the closest SAE edit would even look like here. SAE directions don’t live in weight space. Do we make a LoRA with its right or left singular vectors hard-wired to be parallel to some SAE decoder direction? In that case the other singular vector would still be completely free to vary, so it would be a pretty different kind of fine tune than the component based edit, with way more parameters.
You could give up on editing the model itself and just make a steering vector I guess?
More generally, you can’t just ‘fine-tune in the SAE basis’, because the SAE isn’t a basis for anything, it’s overcomplete. If you try to write the model weight gradient on the backwatd pass in the ‘basis’ of the SAE decoder elements, you can’t. You can insert the SAE on the forward pass and backprop through it on the backward pass, but the resulting gradients probably end up very different from those in the target model, because SAEs are trained to reconstruct activations on the forward pass, not gradients on the backward pass. You could try to make a new kind of SAE that’s trained to reconstruct gradients on the backward pass as well as activations on the forward pass I guess? IIRC some people may have tried something like this at some point. Unsure whether they got it working though.
Oh yes, in my head I was just thinking of something like constant fixed steering as the MVP
In that case, sounds kind of tough to me? If it’s constant fixed steering, you can’t have it be on when there’s German in the input and off when there isn’t any. So you might be messing up the activations a little bit on every forward pass with your negative German steering.
I haven’t actually done any activation steering though.
Do you currently have / are you planning to soon have a public demo / codebase somewhere? My team at Arcadia Alignment is interested in modular ways to introduce properties into language models without introducing excessive artifacts, and this sounds promising!
The parameter decomposition repo is all public. The code I used for the fine tuning isn’t, but it was really basic stuff an agent hacked together.