Jude Stiel comments on Why Not Just Train For Interpretability?

Jude Stiel 23 Nov 2025 19:14 UTC
8 points
0
I have some empirical observations to lend here. I recently spent a few months optimizing a DNA language model for intrinsic interpretability.
There were, as I had hoped, many neurons corresponding neatly to interpretable concepts. This was enough for my purposes: I was trying to build a tool, not solve interpretability or alignment. Random sequences are riddled with functional promoters and other motifs, and us synthetic biologists didn’t have anything like a universal debugger, nor a universal annotator for poorly studied species—even a flawed tool would be a major step forward.
The best activation (by my arbitrary judgment, sifting endlessly through neurons) was a combination of continuous approximations to the activation functions in Deep L0 Encoders, further constrained to be nonnegative and unit norm. I created the activation through several months of trial and error and realized the connection after the fact. Note that no penalties were added to the loss, and it trained just fine.
While it was often easy to to interpret many neurons post-hoc, I could never have guessed beforehand what the (superficially apparent) ontology would be. For instance, CRP and FNR are two 22-base-pair palindromic motifs; I had hoped to find a “CRP neuron” and an “FNR neuron,” but found a group of neurons each active at one position in these palindromes. AI-for-bio people love to use linear probes to establish the “presence of a concept” in their models, I feel now that this is bogus. The model modeled CRP fine, it just didn’t have use for a single direction over the whole motif.
However, the most helpful tool was visualizing the pairwise similarities between the activations (i.e., their Gram matrix). The activations’ degree of similarity often primarily reflected their offset, unless the “feature” being represented was periodic in nature, like a beta-barrel. I don’t think that my more-interpretable activations, nor SAEs, nor any obvious-to-me kind of weight or activation sparsity technique, could have made this pattern much clearer with ~any degree of effort. (At least, I have no clue how I would have counterfactually spotted it).
I’d call this an empirical win for the thesis that unless you have a way to get some level of insight into how the activations are structured without presuming that structure, your method ain’t gonna have feedback loops.
(Interestingly, the images produced on a given protein by the Gram lens for my small convolutional bacterial DNA model were obviously visually similar to those from a much more heavily trained all-of-life protein Transformer, including the offset-dependent similarity.)
There is certainly still structure I can’t see. The final iteration of the model is reverse-complementation-equivariant by design. RC-equivariant models trained far more quickly than unconstrained ones, but whereas unconstrained models learned many invariant features, equivariant ones never appeared to. The presence of a partial RC-equivariance, learned in an unconstrained model, would not be made clearer by sparse activations or by the Gram matrices (the paired directions are orthogonal). I’m unsure what kind of tool would reveal this kind of equivariance, if you weren’t already looking for it.