Kaarel comments on Extracting Performant Algorithms Using Mechanistic Interpretability

Kaarel 14 Mar 2026 16:47 UTC
12 points
3
The model was trained to predict the next DNA token. Nobody told it about evolution or gave it a phylogenetic tree as a training signal. But the model needed to understand evolutionary relationships in order to predict DNA well, and so it built a structured geometric representation of those relationships as part of its internal computation, and the representation was good enough that you could extract it with interpretability tools and compare it meaningfully to the ground truth.

I haven’t read Goodfire’s post and I’ve thought about this for like 3 minutes ^[1] but my guess is that this is a vast exaggeration of the result, with the correct thing being much closer to “it’s good to have similar DNAs assigned similar representations, and one can read off phylogenic distance from similarity”. Like, my guess is that it’s almost entirely not about the model understanding evolutionary relationships or using representations of evolutionary relationships.

E.g. in a toy version of evolution where the only thing that can happen are single nucleotide changes, you could read off phylogenic distance really well from distance between completely naive embedding vectors constructed by just having a coordinate of the vector for each position in the genome, and its value be for respectively A/T/G/C. It’d be a vast overstatement to say this model understands evolutionary relationships or uses them in its internal computation.

addendum: I skimmed Goodfire’s post now and saw they have a section on “disentangling phylogenetics from sequence similarity”. But imo the thing they do there completely fails to get rid of this simple reason for having phylogeny be readable from representation vectors. It probably would kill the very naive construction I gave, but it doesn’t kill reasonable mildly more sophisticated things.

p.s.: Another thing in the same vein: “you can make a pretty high accuracy linear probe for X on a model’s activations” is much weaker than e.g. “the model represents X as a basic variable”, “the computation implemented in the weights uses X”, “the model thinks in terms of X”, “the model understands X”, or even “the model computes X”. (Consider for example how various properties could be probed well from RAM activations when running a computer game, that aren’t actually remotely computed by the source code.)
1. ↩︎
  though I’ve thought much more about related questions in the past
- cdt 15 Mar 2026 18:08 UTC
  3 points
  0
  Parent
  it’s good to have similar DNAs assigned similar representations, and one can read off phylogenic distance from similarity
  You’re bang-on here. The averaging-over method would prevent sequence similarity readoffs, but it wouldn’t prevent any genome-level characters such as kmer spectra, GC content or average linkage disequilibrium, which are correlated with phylogenies but are not deterministic of phylogenetic placement. This point is made in the second half of the Goodfire article yet it is not pointed out that this contradicts the first half.
  I would be very surprised if Evo etc. do faithfully represent a phylogeny because phylogenies do not contain much information about genomic characteristics except at very broad scales. A better way to determine this would be to build a phylogeny from the representations and compare those, not the distances in the representation space with the phylogenetic distance. Phylogenies are used for their topology and branch lengths, of which phylogenetic distance is a poor summary.
- Ihor Kendiukhov 14 Mar 2026 17:05 UTC
  3 points
  0
  Parent
  I adjusted the text about the Goodfire discovery a bit, so it hopefull addresses your main concern which is likely valid.
  But more generally, the entire thing about that discovery is that manifold is compact and can be detected, not an abstract realization that the model somehow represents distances. I think it is evidence towards the idea that model does use these representations. In any case, that’s a finding about how foundation models organize features geometrically, and it has methodological implications for interpretability under any interpretation of what the features themselves represent.
  Regardless, the Goodfire result is more for storytelling in my post, not load-bearing evidence.
  - Kaarel 14 Mar 2026 18:04 UTC
    5 points
    1
    Parent
    But more generally, the entire thing about that discovery is that manifold is compact and can be detected, not an abstract realization that the model somehow represents distances. I think it is evidence towards the idea that model does use these representations. In any case, that’s a finding about how foundation models organize features geometrically, and it has methodological implications for interpretability under any interpretation of what the features themselves represent.
    
    I agree this is evidence that the model uses these representations ^[1] , I just think it is weak evidence, because you’d expect the same thing to show up even if the model doesn’t use these representations. (And given that it is weak evidence, we roughly speaking have to ignore it and stick with our priors from other considerations when judging whether the model uses these representations.) I don’t just think that the manifold existing isn’t that meaningful, I also think the manifold being “compact” ^[2] or detectable isn’t that meaningful either. In my toy naive DNA embedding example, I think one can read off phylogenic distances from projections of points to a completely random low-dimensional subspace. ^[3] When I’m playing a sufficiently high-dimensional video game, I’d guess that there is a simple probe on RAM activations which distinguishes with accuracy whether I’m really tired vs not at all tired. I’m not that familiar with how RAM positions are allocated to variables typically, but I’d pretty strongly guess that usually you can do this with a linear probe in fact, i.e. that there is a one-dimensional subspace of my video game program’s activations which captures this property. I would also bet that such a subspace is findable by gradient descent from a random seed. ^[4] And yet: the naive DNA representation doesn’t use phylogeny at all, and my video game program doesn’t think about whether I’m tired at all.
    
    ↩︎
    well, because if the method gave a result worse than this, that would be evidence that the model doesn’t use these representations
    
    ↩︎
    btw I’m guessing you’re just pointing at the manifold being a low-dimensional subspace?
    
    ↩︎
    Roughly speaking, as long as one can read off phylogenic distances from distances in the big space of dimension , one will also be able to read off distances in a random subspace of dimension , by Johnson-Lindenstrauss. One could claim that the dimension of the subspace one finds in practice is smaller than what one would expect from mechanisms of the sort I’m proposing, but I strongly doubt anyone in this epistemic pipeline has thought this sort of thing through.
    
    ↩︎
    It isn’t quite linear regression, but it’s close to it. And linear regression is a convex optimization problem, so (roughly speaking) gradient descent works.