Extracting Performant Algorithms Using Mechanistic Interpretability

The Tree of Life Inside a DNA Language Model

Earlier this year, researchers at Goodfire AI took Evo 2, a genomic foundation model, and found the evolutionary tree of life encoded in it. Not metaphorically. The phylogenetic relationships between thousands of species lived as a curved manifold in the model’s internal activations, with geodesic distances along that manifold tracking actual evolutionary branch lengths. Bacteria that diverged hundreds of millions of years ago sat far apart on the manifold; closely related species sat nearby.

Nobody told Evo 2 about evolution or gave it a phylogenetic tree as a training signal. The model was trained to predict the next DNA token, and because understanding evolutionary relationships turned out to be useful for predicting DNA well, the model built a structured geometric representation of those relationships as part of its internal computation. The representation was good enough that you could extract it with interpretability tools and compare it meaningfully to the ground truth.

I saw this and thought: okay, what happens if I do the same thing to single-cell foundation models?

If Evo 2 learned the tree of life from raw DNA, what did scGPT learn about how human cells develop?

Finding the Manifold

For those unfamiliar with the biology: scGPT is a transformer trained on millions of single-cell gene expression profiles. Each cell in your body expresses thousands of genes at varying levels, and a single-cell RNA sequencing experiment measures those expression levels for potentially hundreds of thousands of individual cells simultaneously. scGPT was pre-trained on this data in a generative fashion, learning to predict masked gene expression values from context.

The question I wanted to answer: does scGPT encode, somewhere in its attention tensor, a compact geometric representation of some biological process? And if so, can I find it without knowing in advance exactly where to look?

I attacked this with a two-phase research loop driven by an AI executor-reviewer pair operating under pre-registered quality gates. Phase 1 was a broad hypothesis search: the loop explored a large combinatorial space of candidate manifold hypotheses by varying the biological target (developmental ordering, regulatory structure, communication geometry), the featurization strategy (attention drift, raw embeddings, mixed operators), and the geometric fitting method (Isomap, geodesic MDS, Locally Euclidean Transformations), all applied across the full 12-layer × 8-head scGPT attention tensor. That is 96 individual attention units to screen.

What came out of Phase 1 was a robust positive hit: hypothesis H65, which identified a compact, roughly 8-to-10-dimensional manifold in specific attention heads where positions along the manifold corresponded to how far cells had progressed through hematopoietic differentiation. Stem cells clustered at one end. Terminally differentiated blood cell types (T cells, B cells, monocytes, macrophages) spread out along distinct branches at the other end. The branching topology matched the known developmental hierarchy with statistically significant branch structure that held up under stringent controls.

Phase 2 was more manual: methodological closure tests, confidence intervals, structured holdouts, and external validation. I validated the manifold on a non-overlapping panel from Tabula Sapiens and then confirmed it via frozen-head zero-shot transfer to an entirely independent multi-donor immune panel. You can explore this manifold yourself and compare different extraction variants in an interactive 3D viewer.

But Does the Extracted Algorithm Actually Work?

Finding a biologically meaningful manifold inside a foundation model is cool. I want to be clear that I do think it is genuinely cool. But the question I actually cared about was whether you can take this geometric object out of the model and use it as a standalone method that does useful work in the world.

I developed a three-stage extraction pipeline. First, I directly exported the frozen attention weight matrices from the relevant heads, with no retraining, just literally reading out the learned linear operator. Second, I attached a lightweight learned adaptor that projects the raw attention output into the manifold’s coordinate system. Third, I added a task-specific readout head for classification or pseudotime prediction. The critical property of this pipeline is that the heavy lifting, the actual biological knowledge, comes entirely from the frozen attention weights that scGPT learned during pre-training. The adaptor and readout are small and cheap to train, and they never touch the original dataset the model was pre-trained on. What you end up with is a standalone algorithm you can ship as a file and run independently of scGPT.

So how does it perform?

I benchmarked the extracted algorithm against a lineup of methods that biologists actually use in practice: scVI (a deep generative model for single-cell data), Palantir (a pseudotime method based on diffusion maps and Markov chains), Diffusion Pseudotime (the Scanpy implementation), CellTypist (a logistic-regression-based cell type classifier trained on a large reference atlas), PCA, and raw-expression baselines. These are the standard tools in the single-cell bioinformatics toolkit, developed and refined by domain experts over years.

On pseudotime-depth ordering, which measures how well a method recovers the true developmental progression from stem cells to mature blood cells, the extracted algorithm was the best performer, significantly outperforming every tested alternative in paired split-level statistics. On classification, the picture was less clean but still strong: the extracted head led on branch balanced accuracy and on key subtype discrimination tasks like CD4/​CD8 T cell separation and monocyte/​macrophage distinction. On some stage-level and branch-level macro-F1 metrics, diffusion-style baselines or raw expression had the edge. Not a clean sweep. But the extracted algorithm is solidly top-tier across the board, and dominant on the most biologically meaningful endpoint.

Now, the obvious objection: maybe this is just the result of having a fancier probe? Maybe any sufficiently flexible function fitted on top of scGPT’s embeddings would do equally well, and the “manifold discovery” part is doing no real work. I tested this. A 3-layer MLP with 175,000 trainable parameters, fitted on frozen scGPT average-pooled embeddings, was significantly worse than the extracted 10-dimensional head on 6 out of 8 classification endpoints. And the extracted head accomplished this while being 34.5× faster to evaluate across a full 12-split campaign, with roughly 1,000× fewer trainable parameters.

I want to make sure the implication here is not lost: the geometric structure that mechanistic interpretability found inside scGPT’s attention heads, when extracted and used directly, outperforms the standard approach of slapping an MLP on top of the model’s embeddings. The interpretability-derived method is simultaneously more accurate, faster, and smaller.

How Small Can You Go?

Once you have an extracted algorithm that works, the natural next question is how much of it you actually need. Compression is interesting for practical reasons, but it is even more interesting as an interpretability question, because the further you compress an algorithm while preserving its performance, the closer you get to understanding what the algorithm is actually doing.

The initial extracted operator pooled three attention heads from scGPT and weighed 17.5 MB. Not large by modern standards, but not trivially inspectable either. The first compression step was asking whether we really need all three heads or whether a single one carries the essential geometry. I scanned all 96 attention units in scGPT’s tensor and found that a single unit, Layer 2 Head 5, carried substantial transferable developmental geometry on its own. The compact operator built from this single head weighed 5.9 MB and showed almost no loss compared to the three-head version on the benchmark suite.

The second compression step was more aggressive: truncated SVD on the single-head operator. This factors the weight matrix into low-rank components and throws away everything below a chosen rank threshold. At rank 64, the resulting surrogate shrinks to 0.73 MB, and it still beats the frozen scGPT average-pool + MLP baseline on all eight pooled classification endpoints. It does incur statistically significant losses versus the dense single-head operator on 5 out of 8 endpoints, so this is not free compression, but the rank-64 version is still a better algorithm than the standard probing approach, at a fraction of a megabyte.

And here is where the interpretability payoff arrives. I ran a factor ablation audit on the rank-64 surrogate: systematically removing each of the 64 factors one at a time, measuring how much performance drops, and ranking them by necessity. Just four factors, out of 64, accounted for 66% of the total pooled ablation impact. And when I examined what those four factors corresponded to biologically, they resolved into explicit hematopoietic gene programs.

So we went from a 7-billion-parameter foundation model to four interpretable factors that explain the majority of the extracted algorithm’s performance on blood cell development. That is the kind of compression ratio where you stop saying “we found a useful representation” and start saying “we might actually understand what the model learned.”

The Dual-Use Problem Is No Longer Theoretical

I want to step back from the specific results and state what I think the general lesson is, because I believe it generalizes well beyond blood cells and single-cell genomics.

The very property that makes this result interesting is also the property that makes me cautious about applying the same techniques to large language models. The argument runs in both directions. If you can extract an algorithm that a model uses to do something well, you can potentially also improve how it does that thing: by identifying inefficient components, scaling the relevant circuits, composing extracted subroutines in new ways, or replacing the fuzzy learned version with a cleaner extracted version and freeing up capacity for the model to learn something else. Mechanistic interpretability, in other words, is becoming a capability amplification tool.

This has been a well-known theoretical concern for a while, but I think the results I am reporting here make it look increasingly practical. Let me sketch two scenarios that should make the worry concrete.

First scenario: you identify the circuit in a language model responsible for multi-step planning, extract it, find that it operates at low rank with substantial redundancy, and publish a paper showing how to compress it. Now anyone training the next generation of models can initialize that circuit more efficiently, or allocate more capacity to the components that matter. Second scenario: you discover that a model’s chain-of-thought reasoning relies on a specific attention pattern that routes information through intermediate tokens in a predictable way, and you publish a detailed mechanistic account of how this works. Now someone building an inference-time scaling pipeline can optimize that routing directly rather than relying on the model to rediscover it from scratch.

This is one of the reasons I have deliberately chosen to focus my interpretability work on biological foundation models. I am aware that pushing biology forward carries its own risk profile, but given the current AI risk landscape, I think accelerating biomedical capability is one of the more defensible things a person can do, and pushing biology is what I am trying to do.

Mechanistic Interpretability as a Knowledge Discovery Tool

On a more positive and general note: beyond its role as an auditing and monitoring tool, mechanistic interpretability can serve as a knowledge discovery tool in its own right. The model learned something about hematopoiesis that existing bioinformatics methods had not fully captured, at least not in the same compact form. The interpretability pipeline found a representation that, when extracted and deployed as a standalone algorithm, outperformed established tools on the most biologically meaningful benchmarks. The knowledge extracted from the model’s internals was new in the operationally relevant sense: nobody had this particular algorithm before, and it works better than what people were using.

I think this is underappreciated. There is a strong tendency in the interpretability community to frame the work as “we want to understand what models are doing so we can verify they are doing it safely,” and that framing is important and correct. But there is a second framing that deserves equal weight: models trained on large scientific datasets may have learned things that we have not learned yet, and interpretability is the tool that lets us go in and read out what those things are. The manifold I extracted from scGPT is a case study in exactly this: a piece of biological knowledge that existed inside a model and was not available to the field until interpretability methods pulled it out and made it usable.

Join In

If you are doing mechanistic interpretability on LLMs and you have been looking for a way to do work that is simultaneously methodologically interesting, practically useful for biomedicine (and yes, human intelligence amplification), and safe with respect to the capability externalities that worry me about LLM interpretability, I genuinely encourage you to consider switching to biological foundation models.

The work I described here is part of a broader research program on mechanistic interpretability of biological foundation models. Earlier I published a comprehensive stress-test of attention-based interpretability methods on scGPT and Geneformer. In parallel, I developed a sparse autoencoder atlas covering 107,000+ features across all layers of both Geneformer and scGPT. The hematopoietic manifold paper is the latest piece.

There is a lot more to do, both in applying these methods to other biological systems and developmental processes, and in developing better unsupervised techniques for manifold discovery that could scale beyond what the current semi-supervised approach allows. I think this is one of the best places in the current research landscape to do interpretability work that matters on all three axes at once: intellectual interest, practical impact, and safety considerations.

You can find more about the research program, ongoing projects, and ways to get involved at biodynai.com. The full paper with all supplementary materials is on arXiv, and the interactive 3D manifold viewer is at biodyn-ai.github.io/​​hema-manifold.

No comments.