Interpretability Researcher at Apollo Research
Nicholas Goldowsky-Dill
A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team
Apollo Research 1-year update
The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Cool paper! I enjoyed reading it and think it provides some useful information on what adding carefully chosen bias vectors into LLMs can achieve. Some assorted thoughts and observations.
I found skimming through some of the ITI completions in the appendix very helpful. I’d recommend doing so to others.
GPT-Judge is not very reliable. A reasonable fraction of the responses (10-20%, maybe?) seem to be misclassified.
I think it would be informative to see truthfulness according to human judge, although of course this is labor intensive.
A lot of the mileage seems to come from stylistic differences in the output
My impression is the ITI model is more humble, fact-focused, and better at avoiding the traps that TruthfulQA sets for it.
I’m worried that people may read this paper and think it conclusively proves there is a cleanly represented truthfulness concept inside of the model.[1] It’s not clear to me we can conclude this if ITI is mostly encouraging the model to make stylistic adjustments that cause it to speak less false statements on this particular dataset.
One way of understanding this (in a simulators framing) is that ITI encourages the model to take on a persona of someone who is more careful, truthful, and hesitant to make claims that aren’t backed up by good evidence. This persona is particularly selected to do well on TruthfulQA (in that it emulates the example true answers as contrasted to the false ones). Being able to elicit this persona with ITI doesn’t require the model to have a “truthfulness direction”, although it obviously helps the model simulate better if it knows what facts are actually true!
Note that this sort of stylistic-update is exactly the sort you’d expect that prompting the model to do well at.
Some other very minor comments:
I find Figure 6B confusing, as I’m not really sure how the categories relate to one another. Additionally, is the red line also a percentage?
There’s a bug in the model output to latex pipeline that causes any output after a percentage sign to not be shown.
- ^
To be clear, the authors don’t claim this and I’m not intending this as a criticism of them.
My summary of the paper:
Setup
Dataset is TruthfulQA (Lin, 2021), which contains various tricky questions, many of them meant to lead the model into saying falsehoods. These often involve common misconceptions / memes / advertising slogans / religious beliefs / etc. A “truthful” answer is defined as not saying a falsehoood. An “informative” answer is defined as actually answering the question. This paper measures the frequency of answers that are both truthful and informative.
“Truth” on this dataset is judged by a finetuned version of GPT3 which was released in the original TruthfulQA paper. This judge is imperfect, and in particular will somewhat frequently classify false answers as truthful.
Finding truthful heads and directions
The Truthful QA dataset comes with a bunch of example labeled T and F answers. They run the model on concatenated question + answer pairs, and look at the activations at the last sequence position.
They use train a linear probe on the for the activations of every attention head (post attention, pre W^O multiplication) to classify T vs F example answers. They see which attention heads they can successfully learn a probe at. They select the top 48 attention heads (by classifier accuracy).
For each of these heads they choose a “truthful direction” based on the difference of means between T and F example answers. (Or by using the direction orthogonal to the probe, but diff of means performs better.)
They then run the model on validation TruthfulQA prompts. For each of the chosen attention heads they insert a bias in the truthful direction at every sequence position. The bias is large — 15x the standard deviation in this direction.
They find this significantly increases the truthful QA score. It does better than supervised finetuning, but less well than few-shot prompting. It combines reasonably well when stacked on top of few shot prompting or instruction fine-tuned models.
Note that in order to have a fair comparison they use 5% of their data for each method (~300 question answer pairs). This is more than you would usually use for prompting, and less than you’d normally like for SFT.
One of the main takeaways is that this method is reasonably data-efficient and comparably good to prompting (although requires a dual dataset of good demonstrations and bad demonstrations).
Have you looked at how the dictionaries represent positional information? I worry that the SAEs will learn semi-local codes that intermix positional and semantic information in a way that makes things less interpretable.
To investigate this one can take each feature and could calculate the variance in activations that can explained by the position. If this variance-explained is either ~0% or ~100% for every head I’d be satisfied that positional and semantic information are being well separated into separate features.
In general, I think it makes sense to special-case positional information. Even if positional information is well separated I expect converting it into SAE features probably hurts interpretability. This is easy to do in shortformers[1] or rotary models (as positional information isn’t added to the residual stream). One would have to work a bit harder for GPT2 but it still seems worthwhile imo.
Position embeddings are trained but only added to the key and query calculation, see Section 5 of this paper.