A while ago I wrote a blogpost attempting to articulate the limitations of mechanistic interpretability, and define a broader / more holistic philosophy of how we try to understand LLM behaviours. At the time I called this ‘prosaic interpretability’, but didn’t like this very much in hindsight.
Since then I’ve updated on the name, and I now think ‘functional’ or ‘black-box’ interpretability is a good term for this. Copying from a comment by @L Rudolf L (emphasis mine)
the only thing we fundamentally care about with LLMs is the input-output behaviour (I-O)
now often, a good way to study the I-O map is to first understand the internals M
but if understanding the internals M is hard but you can make useful generalising statements about the I-O, then you might as well skip dealing with M at all (c.f. psychology, lots of econ, LLM papers like this)
...
There’s perhaps a similar vibe difference here to category theory v set theory: the focus being relations between (black-boxed) objects, versus the focus being the internals/contents of objects, with relations and operations defined by what they do to those internals
I think this accurately describes several types of ongoing work:
The model organisms research agenda that Anthropic’s alignment science team is pursuing
Owain Evans—style research on cognitive abilities and emergent properties of LLMs
Generally, identifying and studying upstream causes of LLM behaviour that extend beyond looking at the static artifact (pretraining data, midtraining data, optimization objectives, general inductive biases, learning theory, … )
---
I don’t think any of this is particularly novel to those in the know, but I’m writing this so I can point at it in the future
“Functional interpretability”
A while ago I wrote a blogpost attempting to articulate the limitations of mechanistic interpretability, and define a broader / more holistic philosophy of how we try to understand LLM behaviours. At the time I called this ‘prosaic interpretability’, but didn’t like this very much in hindsight.
Since then I’ve updated on the name, and I now think ‘functional’ or ‘black-box’ interpretability is a good term for this. Copying from a comment by @L Rudolf L (emphasis mine)
I think this accurately describes several types of ongoing work:
The model organisms research agenda that Anthropic’s alignment science team is pursuing
Owain Evans—style research on cognitive abilities and emergent properties of LLMs
Generally, identifying and studying upstream causes of LLM behaviour that extend beyond looking at the static artifact (pretraining data, midtraining data, optimization objectives, general inductive biases, learning theory, … )
---
I don’t think any of this is particularly novel to those in the know, but I’m writing this so I can point at it in the future