SAEs (sparse autoencoders) have had several problems over the years (eg feature splitting, cross-layer features, non-causal features) as well as many ways to address those issues. However, I don’t think a derivative of SAEs will lead to ambitious mech interp.
The Apollo (Now Goodfire) folks of Lee, Lucius, Dan have worked on Parameter Decomposition (PD)^[1]^, a weight-based approach intending to improve over SAEs in a couple ways:
make cross-layer features a natural object (ie just define a weight-“mechanism” over multiple layers)
faithfulness to the original computation: if all your “mechanisms” (their term for features) sum to the original model, then it’s faithful
multi-dimensional features are also a natural object
I’m currently excited about tensor-transformers, which are more interpretable by design (eg you can principally apply linear algebra since a tensor is a generalization of a matrix). Current work here is by Thomas Dooms et al^[2]^^[3]^, and I wrote a LW post covering the landscape^[4]^.
Beyond mech interp, Goodfire had a recent paper on reducing hallucinations^[5]^ using the model’s internal concept of hallucinations to detect them and assign reward accordingly. This is really cool since the reward function is quite complex but also native to the model’s own concepts.
[disclaimer: currently just on my phone, so had Claude add links. Let me know if anything doesn’t match up]
Assume utility is logarithmic in income, and the goal is to set the experienced tax burden to be constant.
Then, we have the formula that the average tax rate, where a is a parameter controlling the experienced tax burden and z is the break-even point, is as follows:
f(x)=1−(xz)a−1
x is the input income, and f(x) is the average tax rate.
What’s going on with interpretability these days?
I found the whole monosemantic sparse autoencoder idea interesting, but this was 2023 and it’s now 2026.
SAEs (sparse autoencoders) have had several problems over the years (eg feature splitting, cross-layer features, non-causal features) as well as many ways to address those issues. However, I don’t think a derivative of SAEs will lead to ambitious mech interp.
The Apollo (Now Goodfire) folks of Lee, Lucius, Dan have worked on Parameter Decomposition (PD)^[1]^, a weight-based approach intending to improve over SAEs in a couple ways:
make cross-layer features a natural object (ie just define a weight-“mechanism” over multiple layers)
faithfulness to the original computation: if all your “mechanisms” (their term for features) sum to the original model, then it’s faithful
multi-dimensional features are also a natural object
I’m currently excited about tensor-transformers, which are more interpretable by design (eg you can principally apply linear algebra since a tensor is a generalization of a matrix). Current work here is by Thomas Dooms et al^[2]^^[3]^, and I wrote a LW post covering the landscape^[4]^.
Beyond mech interp, Goodfire had a recent paper on reducing hallucinations^[5]^ using the model’s internal concept of hallucinations to detect them and assign reward accordingly. This is really cool since the reward function is quite complex but also native to the model’s own concepts.
[disclaimer: currently just on my phone, so had Claude add links. Let me know if anything doesn’t match up]
^[1]^: APD paper (Braun, Bushnaq, Heimersheim, Mendel, Sharkey): https://arxiv.org/abs/2501.14926; SPD followup: https://www.goodfire.ai/research/stochastic-param-decomp
^[2]^: Bilinear MLPs Enable Weight-Based Mech Interp (Pearce, Dooms, Rigg, Oramas, Sharkey): https://arxiv.org/abs/2410.08417
^[3]^: Compositionality Unlocks Deep Interpretable Models (Dooms, Gauderis, Wiggins, Oramas): https://arxiv.org/abs/2504.02667
^[4]^: Tensor-Transformer Variants are Surprisingly Performant: https://www.lesswrong.com/posts/hp9bvkiN3RzHgP9cq/
^[5]^: RLFR: Reinforcement Learning from Feature Rewards: https://www.goodfire.ai/research/rlfr
.
Pseudo-flat tax formula:
Assume utility is logarithmic in income, and the goal is to set the experienced tax burden to be constant.
Then, we have the formula that the average tax rate, where a is a parameter controlling the experienced tax burden and z is the break-even point, is as follows:
f(x)=1−(xz)a−1x is the input income, and f(x) is the average tax rate.
Fans of math exercises related to toy models of taxation might enjoy this old post of mine.
What is x and why isn’t it cancelling?
x is the initial income, and I forgot to cancel it. Good point.
Turns out, it’s far simpler than I had it as.
Can a eat that −1?
It could do, but a represents the amount of utility remaining.
Maybe the more natural thing would be to have a be the effective tax rate, and have it be (z/x)^a.