What is prosaic interpretability? I’ve previously alluded to this but not given a formal definition. In this note I’ll lay out some quick thoughts.
Prosaic Interpretability is empirical science
The broadest possible definition of “prosaic” interpretability is simply ‘discovering true things about language models, using experimental techniques’.
A pretty good way to do this is to loop the following actions.
Choose some behaviour of interest.
Propose a hypothesis about how some factor affects it.
Hypothesis generation is about connecting the dots.
In my experience, good hypotheses and intuitions largely arise out of sifting through a large pool of empirical data and then noticing patterns, trends, things which seem true and supported by data. Like drawing constellations between the stars.
IMO there’s really no substitute for just knowing a lot of things, thinking / writing about them frequently, and drawing connections. But going over a large pool is a lot of work. It’s important to be smart about this.
Be picky. Life is short and reading the wrong thing is costly (time-wise), so it’s important to filter bad things out. I used to trawl Arxiv for daily updates. I’ve stopped doing this, since >90% of things are ~useless. Nowadays I am informed about papers by Twitter threads, Slack channels, and going to talks / reading groups. All these are filters for true signal amidst the sea of noise.
Distill. I think >90% of empirical work can be summarised down to a “key idea”. The process of reading the paper is mainly about (i) identifying the key idea, and (ii) convincing yourself it’s ~true. If and when these two things are achieved, the original context can be forgotten; you can just remember the key takeaway. Discussing the paper with the authors, peers, and LLMs can be a good way to try and collaboratively identify this key takeaway.
Hypothesis testing is about causal interventions.
In order to test hypotheses, it’s important to do a causal interventions and study the resulting changes. Some examples are:
Change the training dataset / objective (model organisms)
Change the test prompt used (jailbreaking)
Change the model’s forward pass (pruning, steering, activation patching)
Change the training compute (longitudinal study)
In all cases you usually want to have sample size > 1. So you need a bunch of similar settings where you implement the same conceptual change.
Model organisms: Many semantically similar training examples, alter all of them in the same way (e.g. adding a backdoor)
Jailbreaking: Many semantically similar prompts, alter all of them in the same way (e.g. by adding an adversarial suffix)
etc.
Acausal analyses. It’s also possible to do other things, e.g. non-causal analyse. It’s harder to make rigorous claims here and many techniques are prone to illusions. Nonetheless these can be useful for building intuition
Attribute behaviour to weights, activations (circuit analysis, SAE decomposition)
Attribute behaviour to training data (influence functions)
Conclusion
You may have noticed that prosaic interpretability, as defined here, is very broad. I think this sort of breadth is necessary for having many reference points by which to evaluate new ideas or interpret new findings, c.f. developing better research taste.
Nowadays I am informed about papers by Twitter threads, Slack channels, and going to talks / reading groups. All these are filters for true signal amidst the sea of noise.
Are there particular sources, eg twitter accounts, that you would recommend following? For other readers (I know Daniel already knows this one), the #papers-running-list channel on the AI Alignment slack is a really ongoing curation of AIS papers.
One source I’ve recently added and recommend is subscribing to individual authors on Semantic Scholar (eg here’s an author page).
Hmm I don’t think there are people I can single out from my following list that have high individual impact. IMO it’s more that the algorithm has picked up on the my trend of engagement and now gives me great discovery.
For someone else to bootstrap this process and give maximum signal to the algorithm, the best thing to do might just be to follow a bunch of AI safety people who:
post frequently
post primarily about AI safety
have reasonably good takes
Some specific people that might be useful:
Neel Nanda (posts about way more than mech interp)
Dylan Hadfield-Menell
David Duvenaud
Stephen Casper
Harlan Stewart (nontechnical)
Rocket Drew (nontechnical)
I also follow several people who signal-boost general AI stuff.
Scaling lab leaders (Jan Leike, Sam A, dario)
Scaling lab engineers (roon, Aidan McLaughlin, Jason Wei)
Huggingface team leads (Philip Schmidt, Sebastian Raschka)
What is prosaic interpretability? I’ve previously alluded to this but not given a formal definition. In this note I’ll lay out some quick thoughts.
Prosaic Interpretability is empirical science
The broadest possible definition of “prosaic” interpretability is simply ‘discovering true things about language models, using experimental techniques’.
A pretty good way to do this is to loop the following actions.
Choose some behaviour of interest.
Propose a hypothesis about how some factor affects it.
Try to test it as directly as possible.
Try to test it in as many ways as possible.
Update your hypothesis and repeat.
Hypothesis generation is about connecting the dots.
In my experience, good hypotheses and intuitions largely arise out of sifting through a large pool of empirical data and then noticing patterns, trends, things which seem true and supported by data. Like drawing constellations between the stars.
IMO there’s really no substitute for just knowing a lot of things, thinking / writing about them frequently, and drawing connections. But going over a large pool is a lot of work. It’s important to be smart about this.
Be picky. Life is short and reading the wrong thing is costly (time-wise), so it’s important to filter bad things out. I used to trawl Arxiv for daily updates. I’ve stopped doing this, since >90% of things are ~useless. Nowadays I am informed about papers by Twitter threads, Slack channels, and going to talks / reading groups. All these are filters for true signal amidst the sea of noise.
Distill. I think >90% of empirical work can be summarised down to a “key idea”. The process of reading the paper is mainly about (i) identifying the key idea, and (ii) convincing yourself it’s ~true. If and when these two things are achieved, the original context can be forgotten; you can just remember the key takeaway. Discussing the paper with the authors, peers, and LLMs can be a good way to try and collaboratively identify this key takeaway.
Hypothesis testing is about causal interventions.
In order to test hypotheses, it’s important to do a causal interventions and study the resulting changes. Some examples are:
Change the training dataset / objective (model organisms)
Change the test prompt used (jailbreaking)
Change the model’s forward pass (pruning, steering, activation patching)
Change the training compute (longitudinal study)
In all cases you usually want to have sample size > 1. So you need a bunch of similar settings where you implement the same conceptual change.
Model organisms: Many semantically similar training examples, alter all of them in the same way (e.g. adding a backdoor)
Jailbreaking: Many semantically similar prompts, alter all of them in the same way (e.g. by adding an adversarial suffix)
etc.
Acausal analyses. It’s also possible to do other things, e.g. non-causal analyse. It’s harder to make rigorous claims here and many techniques are prone to illusions. Nonetheless these can be useful for building intuition
Attribute behaviour to weights, activations (circuit analysis, SAE decomposition)
Attribute behaviour to training data (influence functions)
Conclusion
You may have noticed that prosaic interpretability, as defined here, is very broad. I think this sort of breadth is necessary for having many reference points by which to evaluate new ideas or interpret new findings, c.f. developing better research taste.
Are there particular sources, eg twitter accounts, that you would recommend following? For other readers (I know Daniel already knows this one), the #papers-running-list channel on the AI Alignment slack is a really ongoing curation of AIS papers.
One source I’ve recently added and recommend is subscribing to individual authors on Semantic Scholar (eg here’s an author page).
Hmm I don’t think there are people I can single out from my following list that have high individual impact. IMO it’s more that the algorithm has picked up on the my trend of engagement and now gives me great discovery.
For someone else to bootstrap this process and give maximum signal to the algorithm, the best thing to do might just be to follow a bunch of AI safety people who:
post frequently
post primarily about AI safety
have reasonably good takes
Some specific people that might be useful:
Neel Nanda (posts about way more than mech interp)
Dylan Hadfield-Menell
David Duvenaud
Stephen Casper
Harlan Stewart (nontechnical)
Rocket Drew (nontechnical)
I also follow several people who signal-boost general AI stuff.
Scaling lab leaders (Jan Leike, Sam A, dario)
Scaling lab engineers (roon, Aidan McLaughlin, Jason Wei)
Huggingface team leads (Philip Schmidt, Sebastian Raschka)
Twitter influencers (Teortaxes, janus, near)