Daniel Tan comments on Daniel Tan’s Shortform

Daniel Tan 13 Nov 2025 15:43 UTC
10 points
7
“Functional interpretability”
A while ago I wrote a blogpost attempting to articulate the limitations of mechanistic interpretability, and define a broader / more holistic philosophy of how we try to understand LLM behaviours. At the time I called this ‘prosaic interpretability’, but didn’t like this very much in hindsight.
Since then I’ve updated on the name, and I now think ‘functional’ or ‘black-box’ interpretability is a good term for this. Copying from a comment by @L Rudolf L (emphasis mine)
- the only thing we fundamentally care about with LLMs is the input-output behaviour (I-O)
- now often, a good way to study the I-O map is to first understand the internals M
- but if understanding the internals M is hard but you can make useful generalising statements about the I-O, then you might as well skip dealing with M at all (c.f. psychology, lots of econ, LLM papers like this)
...
There’s perhaps a similar vibe difference here to category theory v set theory: the focus being relations between (black-boxed) objects, versus the focus being the internals/contents of objects, with relations and operations defined by what they do to those internals
I think this accurately describes several types of ongoing work:
- The model organisms research agenda that Anthropic’s alignment science team is pursuing
- Owain Evans—style research on cognitive abilities and emergent properties of LLMs
- Generally, identifying and studying upstream causes of LLM behaviour that extend beyond looking at the static artifact (pretraining data, midtraining data, optimization objectives, general inductive biases, learning theory, … )
---
I don’t think any of this is particularly novel to those in the know, but I’m writing this so I can point at it in the future