Extracting Performant Algorithms Using Mechanistic Interpretability

A Prequel: The Tree of Life Inside a DNA Language Model

Last year, researchers at Goodfire AI took Evo 2, a genomic foundation model, and found, quite literally, the evolutionary tree of life inside. The phylogenetic relationships between thousands of species were encoded as a curved manifold in the model’s internal activations, with geodesic distances along that manifold tracking actual evolutionary branch lengths. Bacteria that diverged hundreds of millions of years ago were far apart on the manifold, and closely related species were nearby.

The model was trained to predict the next DNA token. Nobody told it about evolution or gave it a phylogenetic tree as a training signal. But the model needed to encode evolutionary relationships in order to predict DNA well, and so it built a structured geometric representation of those relationships as part of its internal computation, and the representation was good enough that you could extract it with interpretability tools and compare it meaningfully to the ground truth.

I saw this and decided to apply the same approach to another type of biological foundation models—those trained on single cell data.

If Evo 2 learned the tree of life from raw DNA, what did scGPT learn about how human cells develop?

Finding the Manifold

For those unfamiliar with the biology side: scGPT is a transformer model trained on millions of single-cell gene expression profiles. Each cell in your body expresses thousands of genes at varying levels, and a single-cell RNA sequencing experiment measures those expression levels for potentially hundreds of thousands of individual cells simultaneously. scGPT was pre-trained on this kind of data in a generative fashion, learning to predict masked gene expression values from context.

The question I wanted to answer was: does scGPT encode, somewhere in its attention tensor, a compact geometric representation of some biological processes? And if so, can I find it without knowing in advance exactly where to look?

To attack this systematically, I used a two-phase research loop driven by an AI executor-reviewer pair operating under pre-registered quality gates. Phase 1 was a broad hypothesis search: the loop explored a large combinatorial space of candidate manifold hypotheses by varying the biological target (developmental ordering, regulatory structure, communication geometry), the featurization strategy (attention drift, raw embeddings, mixed operators), and the geometric fitting method (Isomap, geodesic MDS, a technique called Locally Euclidean Transformations), all applied across the full 12-layer × 8-head scGPT attention tensor, which means 96 individual attention units to screen.

What came out of Phase 1 was a robust positive hit: hypothesis H65, which identified a compact, roughly 8-to-10-dimensional manifold in specific attention heads where positions along the manifold corresponded to how far cells had progressed through hematopoietic differentiation. Stem cells clustered at one end; terminally differentiated blood cell types (T cells, B cells, monocytes, macrophages) spread out along distinct branches at the other end; and the branching topology matched the known developmental hierarchy with statistically significant branch structure that held up under stringent controls.

Then I switched to Phase 2 which was rather manual investigation: methodological closure tests, confidence intervals, structured holdouts, and external validation. I validated the manifold on a non-overlapping panel from Tabula Sapiens and then confirmed it via frozen-head zero-shot transfer to an entirely independent multi-donor immune panel. You can explore this manifold yourself and compare different extraction variants, in an interactive 3D viewer.

But Does the Extracted Algorithm Actually Work?

I think finding a biologically meaningful manifold inside a foundation model is, on its own, cool. But the question I actually cared about was: can you take this geometric object out of the model and use it as a standalone method that does useful work?

To do it, I developed a three-stage extraction pipeline:

  1. I directly exported the frozen attention weight matrices from the relevant heads, with no retraining, just literally reading out the learned linear operator.

  2. I attached a lightweight learned adaptor that projects the raw attention output into the manifold’s coordinate system.

  3. And then I added a task-specific readout head (for classification or pseudotime prediction).

The key property of this pipeline is that the heavy lifting, the actual biological knowledge, comes entirely from the frozen attention weights that scGPT learned during pre-training. The adaptor and readout are small and cheap to train, and they never touch the original dataset the model was pre-trained on. What you end up with is a standalone algorithm you can ship as a file and run independently of scGPT.

So: how does it perform?

I benchmarked the extracted algorithm against a lineup of established methods that biologists actually use in practice: scVI (a deep generative model for single-cell data), Palantir (a pseudotime method based on diffusion maps and Markov chains), Diffusion Pseudotime (the Scanpy implementation), CellTypist (a logistic-regression-based cell type classifier trained on a large reference atlas), PCA, and raw-expression baselines. These are the standard tools in the single-cell bioinformatics toolkit, developed and refined by domain experts over years.

On pseudotime-depth ordering, which measures how well a method recovers the true developmental progression from stem cells to mature blood cells, the extracted algorithm appeared to be the best, significantly outperforming every tested alternative in paired split-level statistics. On classification (distinguishing cell types), the picture was less unambiguous but still strong: the extracted head led on branch balanced accuracy and on key subtype discrimination tasks like CD4/​CD8 T cell separation and monocyte/​macrophage distinction. On some stage-level and branch-level macro-F1 metrics, diffusion-style baselines or raw expression had the edge, so this is not a clean sweep, but the extracted algorithm is solidly in the top tier across the board, and dominant on the most biologically meaningful endpoint.

Now, you might reasonably ask: is this just the result of having a fancier probe? Maybe any sufficiently flexible function fitted on top of scGPT’s embeddings would do equally well, and the “manifold discovery” part is not contributing anything real. I tested this. A 3-layer MLP with 175,000 trainable parameters, fitted on frozen scGPT average-pooled embeddings, was significantly worse than the extracted 10-dimensional head on 6 out of 8 classification endpoints. And the extracted head accomplished this while being 34.5 times faster to evaluate across a full 12-split campaign, with roughly 1,000 times fewer trainable parameters.

Let me restate this: the geometric structure that mechanistic interpretability found inside scGPT’s attention heads, when extracted and used directly, outperforms the standard approach of slapping an MLP on top of the model’s embeddings. The interpretability-derived method is simultaneously more accurate, faster, and smaller.

How Small Can You Go Though?

Once you have an extracted algorithm that works, the natural next question is how much of it you actually need. Compression is interesting for practical reasons, but it is even more interesting for interpretability reasons, because the further you compress an algorithm while preserving its performance, the closer you get to understanding what it is actually doing.

The initial extracted operator pooled three attention heads from scGPT and weighed 17.5 MB. Not large by modern standards, but not trivially inspectable either. The first compression step was to ask: do we really need all three heads, or does a single one carry the essential geometry? I scanned all 96 attention units in scGPT’s tensor and found that a single unit, Layer 2, Head 5, carried substantial transferable developmental geometry on its own. The compact operator built from this single head weighed 5.9 MB and showed almost no loss compared to the three-head version on the benchmark suite.

The second compression step was more aggressive: truncated SVD on the single-head operator. This factors the weight matrix into low-rank components and throws away everything below a chosen rank threshold. At rank 64, the resulting surrogate shrinks to 0.73 MB, which is already quite tiny, and it still beats the frozen scGPT average-pool + MLP baseline on all eight pooled classification endpoints. It does incur statistically significant losses versus the dense single-head operator on 5 out of 8 endpoints, so this is not free compression. But the rank-64 version is still a better algorithm than the standard probing approach, at a fraction of a megabyte.

And also, now the interpretability payoff arrives. I ran a factor ablation audit on the rank-64 surrogate: systematically remove each of the 64 factors one at a time, measure how much performance drops, and rank them by necessity. And it appeared that just four factors, out of 64, accounted for 66% of the total pooled ablation impact. And then, when I examined what those four factors corresponded to biologically, they resolved into explicit hematopoietic gene programs.

So, Mechanistic Interpretability is Becoming Dual Use

Let’s step back from the specific results for a moment and consider the high-level lesson here.

The very property that makes this result interesting is also the property that makes me cautious about applying the same techniques to large language models. Because the argument runs in both directions. If you can extract an algorithm that a model uses to do something well, you can potentially also improve how it does that thing: by identifying inefficient components, scaling the relevant circuits, composing extracted subroutines in new ways, by replacing the fuzzy learned version with a cleaner extracted version and freeing up capacity for the model to learn something else etc. Mechanistic interpretability, in other words, is becoming a capability amplification tool. This is a well-known theoretical concern, but it looks like now it is becoming a practical one.

Consider a few scenarios. You identify the circuit in a language model responsible for multi-step planning, extract it, find that it is operating at low rank with substantial redundancy, and publish a paper showing how to compress it. Now anyone training the next generation of models can initialize that circuit more efficiently, or allocate more capacity to the components that matter. Or: you discover that a model’s chain-of-thought reasoning relies on a specific attention pattern that routes information through intermediate tokens in a predictable way, and you publish a detailed mechanistic account of how this works. Now someone building an inference-time scaling pipeline can optimize that routing directly rather than relying on the model to rediscover it from scratch.

This is one of the reasons why I have deliberately chosen to focus my interpretability work on biological foundation models. Although I agree that pushing biology can also be associated with risks, I believe that we really need to push biology asap, considering the current AI risks landscape, and pushing biology is what I am trying to do.

Mechanistic Intepretability for Novel Knowledge Discovery

On a more positive and general note: on top of being an auditing/​monitoring tool, mechanistic interpretability can be a knowledge discovery tool. Consider:

  1. The model learned something about hematopoiesis that existing bioinformatics methods had not fully captured, at least not in the same compact form.

  2. The interpretability pipeline found a representation that, when extracted and deployed as a standalone algorithm, outperformed established tools on the most biologically meaningful benchmarks.

  3. The knowledge extracted from the model’s internals was new in the operationally relevant sense: nobody had this particular algorithm before, and it works better than what people were using.

Join In

If you like mechanistic interpetability, I encourage you to consider switching from LLMs to biological foundation models.

The work I described here is part of a broader research program on mechanistic interpretability of biological foundation models. Earlier I published a comprehensive stress-test of attention-based interpretability methods on scGPT and Geneformer. In parallel, I developed a sparse autoencoder atlas covering 107,000+ features across all layers of both Geneformer and scGPT. The hematopoietic manifold paper is the latest piece.

There is a lot more to do here, both in terms of applying these methods to other biological systems and developmental processes, and in terms of developing better unsupervised techniques for manifold discovery that could scale beyond what the current semi-supervised approach allows. I think this is one of the best places in the current research landscape to do interpretability work that is simultaneously methodologically interesting, practically useful for biomedicine (and yes, human intelligence amplification), and safe with respect to the capability externalities that worry me about LLM interpretability.

You can find more about the research program, ongoing projects, and ways to get involved at biodynai.com. The full paper with all supplementary materials is on arXiv, and the interactive 3D manifold viewer is at biodyn-ai.github.io/​​hema-manifold.