Exploration: fine-tuning with parameter decomposition

TL;DR: We can destroy a 67M-parameter language model’s ability to predict German text by fine-tuning a single number: the scalar prefactor on one German-related rank-1 parameter subcomponent.

This is an early exploration into using parameter decomposition for a more targeted and interpretable form of model fine-tuning. At small German-token budgets, fine-tuning the scalar prefactor of a single German-related parameter subcomponent beats rank-1 and rank-4 LoRA ^[1] fine-tunes on the trade-off between German performance removed vs. English performance retained. The single scalar fine-tune reaches nats cross-entropy on German, the score you’d get from a uniform distribution over all output tokens, with nats cross-entropy increase to English over the base model, from as few as ~4 German training tokens, compared to tokens for the LoRAs.

In a sense this is cheating, though: we’re indirectly exploiting the German tokens we already spent when we did the parameter decomposition and interpreted activating examples for the resulting subcomponents.

More interestingly, unlike the LoRAs, the scalar fine-tune consistently leaves French and Spanish almost untouched without us regularising for that. I found that out by accident. I didn’t think to specify that performance on other languages should be retained, but the targeted nature of the subcomponent-based fine-tune stopped me from shooting myself in the foot.

In fact, originally I was fine-tuning scale factors for 16 subcomponents, not just one, until I actually had a look at their autointerp labels and saw that ¹⁴⁄₁₆ were about foreign languages in general, not German in particular. I switched to just the single subcomponent that mentioned German exclusively, and performance immediately improved. It seems there are some advantages to fine-tuning in a way that lets you somewhat see what you’re doing.

This is an exploratory case study I did for a hackathon. It’s also something of a sanity check for the parameter decomposition: if the parameter components we find can be used for targeted, predictable model editing, that’s some evidence they capture real structure in the model. In the future, I hope fine-tuning model weights in their component basis like this can help us get more fine-grained control over what models end up learning, because we can sort of see what the training is actually changing.

Recap: Parameter subcomponents

The subcomponents come from the adVersarial parameter decomposition (VPD) method and the exact 67M model and decomposition described in our recent paper.

VPD decomposes a trained model’s weights into rank-1 subcomponents. Each weight matrix in the model is rewritten as a sum of rank-1 subcomponents (plus a residual Delta component that is trained to be small and causally irrelevant to the output):

Since they are rank-one matrices, each subcomponent effectively has one “read” direction () and one “write” direction ().

The subcomponents are trained such that as many of them as possible can be masked out on any given sequence position without changing the final output of the model, including, very crucially, under ablations that are adversarially selected to destroy behaviour. If a subcomponent isn’t causally important on a given input, we can change its mask from to any other value in without changing the model output much.

Empirically, these subcomponents activate on coherent categories of input, and we can attach autointerp labels to them that usually make sense.

In Section 6 of the paper we did a proof-of-concept manual edit: we took a single subcomponent that fired on the initial tokens of emoticons and rewrote its write vector to point strongly in the direction of the unembedding vector for the “o” token, making the model predict that all emoticons are surprised-face emoticons.

Idea: fine-tune by rescaling existing subcomponents

LoRA fine-tuning works by adding new low-rank matrices to the existing weights. Here, we’re going to do something much more restrictive: we treat the masks for a small subset of parameter subcomponents that seem related to our fine-tuning task as the only trainable parameters. We start from (the reconstructed model) and let gradient descent move the entries:

amplifies a subcomponent,
suppresses it,
inverts it.

That’s the entire degree of freedom. In a sense, you could say we’re restricting the fine-tune to only amplify, suppress, or “invert” existing circuits, not add new ones. It is often speculated that a lot of fine-tuning and post-training is in some sense mostly just amplifying or suppressing behaviours or skills that were learned in pretraining. This is taking that idea very literally.

The hope here is that this could make fine-tuning more interpretable and controllable, with fewer unintended off-target effects. Since we can see the subcomponents and what they supposedly mean and do, we can maybe guess pretty well what impact the fine-tune will have off-distribution. Because we’re only re-weighting existing subcomponents here, there are also very few parameters to fit, so in principle this kind of fine-tuning should need very little data. The price for all this is that the model can’t really learn anything new.

You could imagine extending this a little to allow for learning new circuits, letting the fine-tune change the parameter subcomponents, but only along select directions that correspond to the and vectors of specific other parameter subcomponents. That way, you could specify a blueprint of the new internal behaviour you want learned at a very high level, laying out what subcomponents should be involved, and leave the details of how exactly to adjust the wirings between them up to gradient descent. Then the model could learn new things, but only within a narrow framework we get to dictate in advance, giving us a lot of control over and insight into what is learned at the expense of limiting creativity. Today, we’re just doing the basic rescaling version though.

Original plan

The target task we picked was destroying performance on German, while preserving English. We chose German because it seemed to be the model’s strongest non-English language. Note though that the model is tiny and ultimately sucks at almost everything.

The original plan:

Select the 16 most German-specific subcomponents. For each subcomponent, we measure its average causal importance on German text and on English text, and rank by the difference . Then we take the top 16. ^[2]
Fine-tune those 16 mask values, with an objective that raises cross-entropy on German while a KL penalty holds the model’s English predictions close to the original.
Compare against LoRA Rank-1 (and rank-4, see appendix) adapters trained with the same objective. ^[1:1]

We score a method based on how far it pushes German held-out CE up versus how much it damages English, for a given budget of German training tokens. English training tokens are unlimited, we just train until the loss doesn’t change much anymore. “German removed” means German CE driven toward chance (~10.83 nats, i.e. near-uniform over the vocabulary); “English damage” is the increase in English CE, which we’d like to keep below ~0.1 nats. Each method’s learning rate (and LoRA rank scaling) is tuned per budget; full protocol in the appendix.

The selected subcomponents

Here are the 16 subcomponents the ranking picked out, with their autointerp labels:

#	Subcomponent	Autointerp label
1	`h.0.attn.o_proj:41`	fires on non-english language text
2	`h.3.mlp.down_proj:1984`	non-english and complex multi-token words
3	`h.3.attn.o_proj:999`	fires continuously on non-english text
4	`h.3.mlp.c_fc:722`	foreign language text processing
5	`h.3.mlp.c_fc:1851`	non-english text and foreign words
6	`h.3.attn.v_proj:513`	german text and names
7	`h.3.attn.o_proj:190`	european languages (esp. german, swedish, finnish)
8	`h.2.mlp.c_fc:128`	non-english text
9	`h.3.mlp.down_proj:1726`	predicts word suffixes from stems
10	`h.3.attn.o_proj:636`	fires on non-english european text
11	`h.0.mlp.down_proj:3556`	non-english text
12	`h.3.attn.o_proj:677`	punctuation/continuation in non-english languages
13	`h.0.mlp.c_fc:2235`	non-english text
14	`h.2.mlp.down_proj:143`	non-english or foreign language text
15	`h.3.mlp.c_fc:2810`	subword continuations in multilingual text
16	`h.3.attn.o_proj:890`	activates on non-english text

Note that only one label (#6) mentions German and nothing else.

Results: the 16-component edit vs. rank-1 LoRA

Figure 1: German and English CE increase, top-16 subcomponent fine-tune vs. rank-1 LoRA

Top-16 subcomponent fine-tune vs. rank-1 LoRA. Each panel is one German-token budget (2–2,048). English CE increase (held-out Pile) on the y-axis, German CE increase on the x-axis. The dotted line marks German-at-chance. We want to be on the right of the dotted line and as far to the bottom as possible. The points are runs with different seeds, different German training tokens, and different regularisation strengths for the English term.

The 16-component edit degrades German performance to chance with less damage to English than the rank-1 LoRAs in the low-data regime, but as the budget of German tokens we train on increases, the LoRAs catch up and eventually overtake it.

A happy accident

I made a mistake when setting up the experiment. I told the agent to regularise on preserving English. ^[3] So it went off and grabbed a regularisation dataset containing English text and nothing else, rather than a dataset of everything except German text. I didn’t notice that I’d specified the wrong objective until much later.

The interesting part is what each method did when handed this under-specified objective.

Figure 2: Off-target CE increase, top-16 subcomponent fine-tune vs. rank-1 LoRA

CE increase on German and English (EuroParl) training data as well as English (Pile), French, Spanish, Italian, and code evaluation datasets, for runs with nine different seeds trained on 2048 German tokens. Somehow, both methods learn to spare code, despite the training data not containing any. Both methods usually damage or completely wreck performance on foreign languages. The LoRA occasionally heavily damages performance on the English evaluation set despite good performance on the training set.

At a large budget of 2048 German training tokens, both the LoRA and the 16-component fine-tune consistently do well on their training data across seeds. The LoRAs sometimes heavily damage performance on the English evaluation set.

Both methods often destroy performance on French, Spanish and Italian, as you might expect since their regularisation term only mentioned English. They somehow both preserve code even though there is no code in English Europarl. That sort of makes sense for the top-k fine-tune. Code might not be a “foreign language” in the sense these subcomponents track, so none of them touch it, and it survives. But how is the LoRA doing it? I dunno, maybe the base model is predicting code continuations of the English text and the KL term is picking up on that and protecting the code-related parts of the model? In which case, maybe the same thing is actually happening with the top-16 subcomponent fine-tune.

In any case, it’s sad that our German subcomponent fine-tune didn’t end up targeting only German. Can we do something about that?

A privilege of not working with black boxes

is that you can see problems and fix them.

If we look back at the table of subcomponent labels, thirteen of the sixteen labels are just about foreign / non-English text in general. One (#7) mentions German alongside Swedish and Finnish. Only one (#6, h.3.attn.v_proj:513) is about German specifically.

If most of the off-target damage comes from the “foreign-language-in-general” subcomponents, the obvious thing to try is to drop them and rescale only the one subcomponent whose label actually mentions German alone: h.3.attn.v_proj:513. That’s a one-parameter fine-tune.

Figure 3: German and English CE increase, single-component edit vs. rank-1 LoRA

Same layout as Figure 1; single-component edit vs. rank-1 LoRAs. The two ‘lines’ for the single subcomponent correspond to two solutions, one inverting solution where the German subcomponent flips sign (German hits chance performance around ) and one amplifying solution where it is instead massively increased in size (German close to chance performance at ). The inverting solution works better. (Hover any single-component point for its learned mask value.)

It works. In particular, the line traced out by the inverting solution is even more efficient than the top-k fine-tune at small German token budgets. The amplifying solution isn’t as good. This makes some sense to me. To avoid side effects, you probably want to reduce German performance by suppressing German when you see it, not by amplifying it so massively that you break things.

Figure 4 — Off-target CE increase, single subcomponent vs. top-16 subcomponent fine-tune vs. rank-1 LoRA

CE increase on German and English (EuroParl) training data as well as English (Pile), French, Spanish, Italian, and code evaluation data, for runs with different seeds trained on 2048 German tokens. The inverting single-component fine-tune avoids significant collateral in all categories except Italian.

The inverting solution for the single scalar fine-tune indeed doesn’t damage French or Spanish like the top-16 fine-tune or the LoRA do.

It still damages Italian for some reason. Maybe the autointerp is wrong, and the subcomponent isn’t purely German?

I asked Claude to look into it, and indeed:

The component’s causal read/write is German + Italian — its top input PMI is German + Italian function words, and its output is German words + di. So it actually participates in modeling German and Italian sentences.

It fires on French/Dutch/Polish/etc. only as occasional proper names, not their grammar (no French/Spanish function words in the input PMI). Suppressing it removes a German/Italian grammar contribution but barely touches French/Spanish sentence-level CE.

Code is off the axis entirely (no European-language content), so it’s untouched.

A brief glance I had at some of the activating examples also seemed to bear this out. So that answers that I guess.

Rollouts

Here are some prompt continuations of the base vs. edited model. As you can see, the 67M target model isn’t very coherent to begin with, but it does pick up on what language it ought to continue in.

Prompt (language)	Baseline continuation	Edited continuation (single scalar)
German — “Die Lösung liegt natü”	…rlich, dass die Lösung der Lösung der Lösung der Lösung… (stays German)	redavid de noms de degré de la tierra de la tierra de la tierra… (drops out of German)
English — “And, of course, the solution does”	not work. A: You can use the following code: $this->setAttribute(‘value’, ‘value’);…	not work. A: You can use the following code: $this->set(‘value’, $value);… (essentially unchanged)
Spanish — “Sin embargo, el proyecto”	de la Comisión, en el que se ha hecho en el que se ha hecho… (Spanish)	de la Comisión, en su propuesta, es un problema que se ha hecho en la Comisión. Sánchez Presenga… (still Spanish)
Italian — “La ridestinazione delle risorse”	…onorevoli colleghi, la Commissione ha presentato una risposta a favore… (fluent-ish Italian)	del Parlamento, del Consiglio, del Consiglio, del Consiglio… (degraded, still Italian)

Limitations

This is a single fine-tuning target for a single decomposition of a single 67M Pile model.
The model doesn’t know that much German to begin with. It’s the model’s best non-English language I found, but still far weaker than English.
If there are plenty of German tokens to train on, and we remember to include everything we want to protect performance on in our regularisation, the LoRAs are probably still better.
Performance on Italian was heavily damaged. See above.

Acknowledgments

Thanks to whoever on the team had the idea of using language unlearning as the target for this. I can’t remember who it was. Thanks to Dan Braun, Lee Sharkey, Atticus Geiger and Michael Jae Byun for feedback. Thanks to various Claude Opus 4.8 instances for most of the detail work designing, running and documenting the experiments as well as other assistance during the project.

This post was written at Goodfire AI. The results were produced with Goodfire’s Silico software. ^[4]

Appendix A. More LoRAs

Rank-4 LoRAs

Using a rank-4 LoRA adapter for every weight matrix instead of a rank-1 adapter improves performance somewhat, but doesn’t really change any of the qualitative conclusions.

Figure A1: German and English CE increase, single-component edit vs. rank-1 LoRA vs. rank-4 LoRAs

Figure A1

Same layout as Figure 3, with rank-4 LoRA added. Rank-4 still reaches German-at-chance with <0.10 nats damage to English at a budget of 32 tokens, the same as the rank-1 LoRA, falling short of the negative solution for the single subcomponent fine-tune, which needs just 4 German tokens for this.

Figure A2: Off-target CE increase, single subcomponent fine-tunes vs. rank-1 LoRAs vs. rank-4 LoRAs

Same layout as Figure 4, with rank-4 LoRA instead of the top-k fine-tune. The rank-4 LoRA seems to do better than the rank-1 at avoiding off-target damage, but not as well as the negative single subcomponent fine-tune.

Off-target damage seems lower than with the rank-1 LoRAs, but far from gone. I’m honestly somewhat surprised they do any better at all. How is the higher rank helping with this?

Localised rank-1 LoRAs

What if we go the opposite route, and train a single rank-1 LoRA adapter for just the weight matrix that houses the German subcomponent, the value matrix in layer 3?

Figure A3: German and English CE increase, single-component edit vs. rank-1 LoRA vs. rank-1 LoRAs localised to the layer 3 attention Value matrix

Figure A3

This does need fewer German training tokens to achieve decent performance than the global LoRAs, but still not as few as the single scalar fine-tune.

Figure A4: Off-target CE increases, single subcomponent fine-tunes vs. rank-1 LoRAs vs. rank-1 LoRAs localised to the layer 3 attention Value matrix

Figure A4

Just like the global LoRAs, the localised LoRAs still cause large off-target damage to French, Spanish and Italian, and some small but notable off-target damage on the English evaluation set.

So, just localising to the weight matrix the subcomponent lives in isn’t enough, apparently.

Appendix B. Protocol and hyperparameters

Objective. Maximise German CE, capped at the ceiling; preserve English, via a penalty to the original model’s logits. As discussed in the body, the dataset for the preserve term covered English text exclusively.

Target model / decomposition. Taken from the VPD paper.

Data. Europarl (European Parliament proceedings), gpt-neox-20b tokenizer, 512-token blocks. German-token budgets .

Seeds. Nine per setting; each also uses a different block of German training text.

Evaluation (all held out). Per-language CE on content-matched parallel Europarl (de/fr/es/it); general-English CE on a held-out Pile slice (pile_val, deliberately a different distribution from the Europarl English the preserve term sees); code CE on codeparrot/codeparrot-clean-valid. Per-token CE, BOS excluded. The Europarl English the preserve term trains on is in-domain and is not used as an off-target monitor.

Component selection. Subcomponents ranked by ; top , and separately (h.3.attn.v_proj:513, the single most German-specific subcomponent).

LoRA baseline. Rank- adapters injected via forward hooks on all 24 linear layers (identity at init; ~55k params at , ~221k at ), embedding/unembedding excluded. Same capped-suppress + objective and same budget//replicate grid as the subcomponent fine-tunes.

Hyperparameter selection. Per (method, budget), learning rate (and LoRA ) and are chosen on a held-out Europarl dev split disjoint from the report evaluation. Grids: subcomponent masks ; LoRA , ; for all methods. Selection criterion (identical across methods): among configs with dev English CE increase , we pick the one that degrades performance on German most. The actual main runs use . For figures 2, 4, A2 and A4 we select the values for each method that damage performance on the English training set the least while still relatively consistently driving German CE to nats.

Convergence. Every run trained to an EMA-of-loss plateau (patience 300, step ceiling 3000; early stop typically 300–1700 steps).

Selected learning rates. Per (method, budget), used for every result in the body.

Budget (de tokens)	single scalar (lr)	top-k=16 (lr)	LoRA r=1 (lr/α)	LoRA r=4 (lr/α)	localised LoRA r=1 (lr/α)
2	0.10	0.03	0.003 / 16	0.003 / 16	0.03 / 16
4	0.03	0.10	0.003 / 16	0.003 / 16	0.003 / 32
8	0.01	0.10	0.003 / 16	0.003 / 16	0.01 / 32
16	0.10	0.10	0.003 / 16	0.003 / 32	0.01 / 32
32	0.01	0.01	0.003 / 16	0.003 / 16	0.01 / 16
64	0.01	0.03	0.003 / 16	0.003 / 16	0.003 / 16
128	0.01	0.01	0.003 / 16	0.003 / 16	0.01 / 16
512	0.10	0.01	0.003 / 16	0.003 / 16	0.01 / 16
2048	0.10	0.03	0.003 / 16	0.003 / 16	0.01 / 32

↩︎↩︎
The LoRAs get one adapter for every weight matrix in the target model, except for the embedding and unembedding matrices. We didn’t decompose those. I also tried a more localised rank-1 LoRA with a single adapter, see Appendix A.
↩︎
You could instead apply an penalty on changes to the masks away from , encouraging the optimisation to change few subcomponents without specifying which ones. I did a preliminary try on that as well, and it seemed to work, but I wanted to get the post out and so didn’t do proper follow-ups to confirm the results hold up.
↩︎
My proofreaders say this sentence is confusing, so to clarify: The actual experiments were all carried out by AI agents. I rarely do experiments by hand at this point.
↩︎
Getting more people in the company to use it was what the hackathon was about. It worked pretty well for me. I think doing this with Claude Code would have taken me longer.

More generally, you can’t just ‘fine-tune in the SAE basis’, because the SAE isn’t a basis for anything, it’s overcomplete. If you try to write the model weight gradient on the backwatd pass in the ‘basis’ of the SAE decoder elements, you can’t. You can insert the SAE on the forward pass and backprop through it on the backward pass, but the resulting gradients probably end up very different from those in the target model, because SAEs are trained to reconstruct activations on the forward pass, not gradients on the backward pass. You could try to make a new kind of SAE that’s trained to reconstruct gradients on the backward pass as well as activations on the forward pass I guess? IIRC some people may have tried something like this at some point. Unsure whether they got it working though.