Interpretability through two lenses: biology and physics

Interpretability is the nascent science of making the vast complexity of billion-parameter AI models more comprehensible to the human mind. Currently, the mainstream approach is reductionist: dissecting a model into many smaller components, much like a biologist mapping cellular pathways. Here, I describe and advocate for the complementary perspective of seeking emergent simplicities[1]: underlying principles, following physics’ march towards universality.


Large language models are not engineered, they are grown—like a sourdough starter or a bonsai[2]. The analogy has become a trope among the (expanding) circle of researchers digging into the inner workings of LLMs, attempting to elucidate how new words are computed from an input sequence of tokens. It is certainly an upgrade from the vaguer “black box” narrative of the early days of ChatGPT. In fact, the reference to organisms runs deeper than it might first appear.

Biology

In the eclectic community of interpretability researchers, a core is now crystallizing around an approach that claims inspiration from biology. For example, Anthropic recently published a paper called On the biology of a large language model[3], which relies on a set of probing techniques referred to therein as a “microscope”. Recently an “embryology of a language model”[4], showcasing an embryo-looking UMAP plot, has caught a lot of attention. Not to mention that it all started with… (artificial) neurons.

This mainstream “biological” interpretability is mobilizing substantial human and computational resources—at places such as Anthropic, DeepMind, Goodfire, or Transluce. Its main concern is to expand on a special kind of model decomposition into features and circuits. Features (in this context) are combinations of neurons that correspond to human-understandable concepts, for example [ducks] or [words ending in -ing]. Circuits are networks of features which seem to implement comprehensible reasoning pathways by combining concepts.

This approach has been successful in providing clear, even actionable findings. It has helped locate bugs and sources of hallucinations. Engineers are now able to tune certain features, after their discovery, to steer language models to exhibit certain traits, such as speaking like a pirate[5]. Anthropic just uncovered “persona vectors”[6], which could be used as little knobs to nudge a model towards certain personality traits.

Upon reflection, however, while features and circuits seem to elucidate certain reasoning mechanisms that LLMs have tacitly implemented during their training, they also make things a little messy. Models have to be pipefitted with a whole array of accessory neural nets (sparse auto-encoders, transcoders, crosscoders) which produce millions of features which then have to be interpreted as human-friendly concepts. And these features can assemble into an astronomical amount of combinatorial circuits.

So, instead of unifying language models around more familiar mathematical objects, the current trend is to add to the complexity. There again, the similarity with a traditional biological perspective is noticeable: biology is messy and thrives in diversity, whether it’s expanding the taxonomy of insect species or discovering new genetic pathways.

Is there an alternative scientific approach to simplifying the vast complexity of billions of weights forming the backbone of these grown networks? I propose:

Physics

Modern physics has long been concerned with inferring simple laws from systems made of considerable amounts of particles in interaction. The success of thermodynamics was to show that all the randomly jiggling molecules in a gas will collectively collapse onto a small set of variables of interest, such as pressure and temperature. Thermodynamics then spawned statistical mechanics, a more general framework which has found applications to explain things as diverse as financial markets, bird flocks, and, indeed, systems of neurons, whether biological or artificial.

So it’s no coincidence that the 2024 Physics Nobel Prize went to neural networks pioneers Hinton and Hopfield. Reflecting on the prize, Princeton’s Bill Bialek, a physicist who has consistently pushed stat-mech outside of its conventional bounds[7] recently wrote[8]:

Physics, at least in part, is a search for principles that are simple and universal. Biology, at least in part, is a celebration of the complexity and diversity of life.

The physics approach to interpretability hopes to find universal patterns (e.g., scaling laws) and foundational principles (e.g., conservation laws) which underlie the convoluted edifice of weights and biases above. Rather than pulling out new computational circuits like biologists chase molecular pathways, a physicist’s dream would be to find a second law of thermodynamics for LLMs (and AI systems in general).

Getting there will take some time, a gradual distillation toward unification, but compelling evidence is already emerging. AI models, regardless of architecture or training details, appear to converge toward similar patterns. Some studies show that their training trajectories “explore the same low-dimensional manifold”[9], and that token dynamics in latent space follow similar pathways across tokens and models[10]. Representation analysis further reveals that different networks can be linearly aligned, supporting the Platonic Representation Hypothesis that models converge towards a common statistical representation of the world.

The emergentist approach thus seems successful at surfacing universal principles that govern both the internal computation and external behavior of models. It might not produce directly actionable items like the biological circuits of mainstream interpretability, but with a little bit of time it will generate guiding principles for design and applications with broad applicability.

In particular, the physics lens, rather than the biological one, is, in my view, the most likely to answer some of the deepest questions about the perplexing new form of intelligence that has emerged from silicon and electrons. Some that come to mind include:
How is next-token prediction, a very localized, short-range action, capable of inducing long-ranged order and apparent planning over hundreds of words?
Are new emergent capabilities still to be expected if models grow even larger; new phase transitions and scaling laws?
And ultimately: to what extent can these systems be thought of as alive or sentient?

Meanwhile, several lower-hanging fruits are within reach in the next few months or couple years, assuming proper resources are allocated. They will be the subject of a later post, but for a spoiler idea, I think extending the view of LLMs as dynamical systems will be fruitful. For one, it will inform, for example, how to more reliably steer models along features without creating diverging trajectories that break the output. In parallel, we might be able to put LLM behavior into equations, as has been done with C. elegans and other organisms[11], with implications for safety and alignment.

Conclusion

Biology and physics, in the traditional sense, have over the past century been exceedingly successful when joining forces. There are good reasons to believe that the same could apply here, and our multi-level understanding of LLMs will flourish when done through multiple lenses.

In science, the most difficult thing is often finding the right questions to ask.
In our case, what are we hoping to understand about large language models—and for what purpose? A better microscope will reveal finer details about the internal wiring of LLMs: how information is encoded and passed along; the elementary computational circuits that resolve syntax, semantics, grammar, meaning, planning. A wider telescope might reveal new objects gliding along high-dimensional orbits, hidden attractors and intrinsic curvatures, all together describing new laws and unifying principles. Who knows, it might even reveal whether there is something akin to consciousness hiding somewhere in the latent space.

Acknowledgements

These thoughts follow from fascinating conversations with many different people, most notably: Jacob Dunefsky, Chris Earls, Toni Liu, Haley Moller, XJ Xu, and researchers at Goodfire. Bialek’s Emergence of Brain paper helped crystallize the main idea presented.

  1. ^

    Phrase by Sri Iyer-Biswas and Charlie Wright in Emergent Simplicities in the Living Histories of Individual Cells (2025)

  2. ^

    Eric Ho, On Optimism for Interpretability, Goodfire AI blog

  3. ^

    Lindsey et al., On the Biology of a Large Language Model, Transformer Circuits Thread (2025)

  4. ^

    Wang et al., Embryology of a Language Model (2025)

  5. ^

    McGrath et al., Mapping the latent space of Llama 3.3 70B, Goodfire AI Research

  6. ^

    Chen et al., Persona Vectors: Monitoring and Controlling Character Traits in Language Models

  7. ^

    and, incidentally, Dario Amodei’s PhD co-advisor

  8. ^

    William Bialek, Emergence of Brains, PRX Life (2025)

  9. ^

    Mao et al., The training process of many deep networks explores the same low-dimensional manifold, PNAS (2024)

  10. ^

    Sarfati et al., Lines of Thought in Large Language Models, ICLR (2025)

  11. ^

    Stephens et al., Dimensionality and Dynamics in the Behavior of C. elegans, PLoS (2008)