Jerdle’s Shortform

Jerdle5 Mar 2025 21:04 UTC

2 points

8 comments1 min readLW link

Jerdle 31 Mar 2026 16:24 UTC
10 points
2
What’s going on with interpretability these days?

I found the whole monosemantic sparse autoencoder idea interesting, but this was 2023 and it’s now 2026.
- Logan Riggs 31 Mar 2026 17:58 UTC
  21 points
  1
  Parent
  SAEs (sparse autoencoders) have had several problems over the years (eg feature splitting, cross-layer features, non-causal features) as well as many ways to address those issues. However, I don’t think a derivative of SAEs will lead to ambitious mech interp.
  The Apollo (Now Goodfire) folks of Lee, Lucius, Dan have worked on Parameter Decomposition (PD)^[1]^, a weight-based approach intending to improve over SAEs in a couple ways:
  - make cross-layer features a natural object (ie just define a weight-“mechanism” over multiple layers)
  - faithfulness to the original computation: if all your “mechanisms” (their term for features) sum to the original model, then it’s faithful
  - multi-dimensional features are also a natural object
  I’m currently excited about tensor-transformers, which are more interpretable by design (eg you can principally apply linear algebra since a tensor is a generalization of a matrix). Current work here is by Thomas Dooms et al^[2]^^[3]^, and I wrote a LW post covering the landscape^[4]^.
  Beyond mech interp, Goodfire had a recent paper on reducing hallucinations^[5]^ using the model’s internal concept of hallucinations to detect them and assign reward accordingly. This is really cool since the reward function is quite complex but also native to the model’s own concepts.
  [disclaimer: currently just on my phone, so had Claude add links. Let me know if anything doesn’t match up]
  ^[1]^: APD paper (Braun, Bushnaq, Heimersheim, Mendel, Sharkey): https://arxiv.org/abs/2501.14926; SPD followup: https://www.goodfire.ai/research/stochastic-param-decomp
  ^[2]^: Bilinear MLPs Enable Weight-Based Mech Interp (Pearce, Dooms, Rigg, Oramas, Sharkey): https://arxiv.org/abs/2410.08417
  ^[3]^: Compositionality Unlocks Deep Interpretable Models (Dooms, Gauderis, Wiggins, Oramas): https://arxiv.org/abs/2504.02667
  ^[4]^: Tensor-Transformer Variants are Surprisingly Performant: https://www.lesswrong.com/posts/hp9bvkiN3RzHgP9cq/
  ^[5]^: RLFR: Reinforcement Learning from Feature Rewards: https://www.goodfire.ai/research/rlfr
  .
Jerdle 5 Mar 2025 21:04 UTC
3 points
0
Pseudo-flat tax formula:
Assume utility is logarithmic in income, and the goal is to set the experienced tax burden to be constant.
Then, we have the formula that the average tax rate, where $a$ is a parameter controlling the experienced tax burden and $z$ is the break-even point, is as follows:
$f (x) = 1 - {(\frac{x}{z})}^{a - 1}$
$x$ is the input income, and $f (x)$ is the average tax rate.
- Buck 5 Mar 2025 22:30 UTC
  3 points
  0
  Parent
  Fans of math exercises related to toy models of taxation might enjoy this old post of mine.
- Gurkenglas 5 Mar 2025 21:24 UTC
  3 points
  1
  Parent
  What is x and why isn’t it cancelling?
  - Jerdle 5 Mar 2025 22:02 UTC
    1 point
    0
    Parent
    x is the initial income, and I forgot to cancel it. Good point.
    
    Turns out, it’s far simpler than I had it as.
    - Gurkenglas 6 Mar 2025 6:04 UTC
      2 points
      0
      Parent
      Can a eat that −1?
      - Jerdle 9 Mar 2025 11:08 UTC
        1 point
        0
        Parent
        It could do, but a represents the amount of utility remaining.
        
        Maybe the more natural thing would be to have a be the effective tax rate, and have it be (z/x)^a.