leogao comments on Why Not Just Train For Interpretability?

leogao 25 Nov 2025 18:21 UTC
19 points
0
this feels like a subtweet of our recent paper on circuit sparsity. I would have preferred a direct response to our paper (or any other specific paper/post/person), rather than a dialogue against a hypothetical interlocutor.
I think this post is unfairly dismissive of the idea that we can guess aspects of the true ontology and iterate empirically towards it. it makes it sound like if you have to guess a lot of things right about the true ontology before you can make any empirical progress at all. this is a reasonable view of the world, but I think evidence so far rules out the strongest possible version of this claim.
SAEs are basically making the guess that the true ontology should activate kinda sparsely. this is clearly not enough to pin down the true ontology, and obviously at some point activation sparsity stops being beneficial and starts hurting. but SAE features seem closer to the true ontology than the neurons are, even if they are imperfect. this should be surprising if you think that you need to be really correct about the true ontology before you can make any progress! making the activations sparse is this kind of crude intervention, and you can imagine a world where SAEs don’t find anything interesting at all because it’s much easier to just find random sparse garbage, and so you need more constraints before you pin down something even vaguely reasonable. but we clearly don’t live in that world.
our circuit sparsity work adds another additional constraint: we also enforce that the interactions between features are sparse. (I think of the part where we accomplish this by training new models from scratch as an unfortunate side effect; it just happens to be the best way to enforce this constraint.) this is another kind of crude intervention, but our main finding is that this gets us again slightly closer to the true concepts; circuits that used to require a giant pile of SAE features connected in an ungodly way can now be expressed simply. this again seems to suggest that we have gotten closer to the true features.
if you believe in natural abstractions, then it should at least be worth trying to dig down this path and slowly add more constraints, seeing whether it makes the model nicer or less nice, and iterating.
- johnswentworth 25 Nov 2025 19:12 UTC
  10 points
  0
  Parent
  this feels like a subtweet of our recent paper on circuit sparsity
  It isn’t. This post has been in my drafts for ages and I just got around to slapping a passable coat of paint on it and shipping it.
  - Ben Pace 25 Nov 2025 20:32 UTC
    2 points
    0
    Parent
    Aside: This is why subtweeting is bad. It makes people paranoid that people are subtweeting them when they aren’t.