wassname comments on Selective Generalization: Improving Capabilities While Maintaining Alignment

wassname 12 Jan 2026 7:18 UTC
1 point
0
Constraining Internal Representations:
We train normally on task while penalizing the average Mean Squared Error of alignment data representations between reference and finetuned model at each hidden layer.
For parameterization and placement of this constraint, perhaps consider:

- SVD-projected activations: Some papers use activations projected to SVD space as a natural basis for this kind of loss.
- Residual stream subspace projections: Remove the embedding directions and the ~75% of the residual stream read by `lm_head`—this avoids constraining inputs and outputs directly. You can also project onto subspaces actually written to during the alignment task, avoiding noise and null subspaces.
- Task-sensitive dimensions: Focus on residual stream dimensions that are sensitive to the alignment task.
Why do I think these are good ideas? LoRA variants that achieve data efficiency, faster convergence, and better generalization often take an opinionated view on the best way to intervene in transformer internals. If we treat them as hypotheses about how to view model representations, their performance provides clues for how to apply constraints like this. What I’ve learned from reading many adapter papers:
- Separate momentum and angle
- Intervene on all linear layers
- Operate in SVD space, especially rotating the V matrix of the weights