Exciting! I think getting expressive, natural-language explanations of LLM internals is an underexplored area of interpretability.
Do you think it would be possible to adapt your technique to interpret model weights? For example, training an oracle to answer questions about LoRAs. I guess if you look merely at activations, you won’t detect rare behaviors (like backdoors) until they actually come up.
Writing this comment, I remembered Goel et al., which did something like what I mentioned above (you actually cited them in your paper). Their blog post mentions that a limitation of their technique is “poor cross-behavior generalization.” I’m wondering if you or @avichal have some insight into why Activation Oracles seemingly generalize better than DIT Adapters.
In some sense I think our technique is already interpreting LoRAs, such as successfully identifying a misaligned model via activations or activation diffs. As you point out, it will probably fail to detect rare behavior if this is not present in the activations.
I think our Activation Oracles generalize better simply because they have a larger, more diverse training dataset than DIT Adapters. I’m guessing if DIT was scaled up we would see improved generalization.
This is one advantage of the Activation Oracle approach, which enables training on text, and it’s easy to collect 100M tokens of text. In contrast, DIT adapters require many trained LoRAs. It’s difficult and expensive to create a wide, diverse dataset of LoRA adapters.
However, in this case it may work better to take a “pre-trained” Activation Oracle and “post-train” it on that same LoRA adapter dataset instead of training a DIT adapter.
Yes, I’m actually involved in some Goel et al. follow-up work that I’m very excited about! I’d say that we’re finding generalization intermediate between the weak generalization in Goel et al. and the strong generalization in our work on Activation Oracles. (And I’d guess that the main reason our generalization is stronger than that in Goel et al. is due to scaling to more diverse data—though it’s also possible that model scale is playing a role.)
One thing that I’ve been struggling with lately is that there’s a substantial difference between the Activation Oracles form-factor (plug in activations, get text) vs. the Goel et al. set-up where we directly train a model to explain its own cognition without needing to pass activations around. It intuitively feels to me like this difference is surface-level (and that the Goel et al. form-factor is better), but I haven’t been able to come up with a way to unify the approaches.
Could you say more about what you see as the pros and cons of each approach? Like, I agree it’s nice that in Goel et al. you can ask questions in 100% natural language rather than having to insert activations. What are the nice things about AOs that you want to keep?
@Adam Karvonen’s comment mentions that DIT-adapters are limited because they require many trained LoRAs. But it seems to me like you could train a DIT-adapter to do all the same tasks you train an AO to do, without necessarily changing the model weights. (Maybe Interpretation Tuning would be a better name than Diff Interpretation Tuning in this case.)
Example:
In the AO paper, you prompt the target model with “She walked to”. Then you grab the activation from “to” and ask the AO “<ACT> Can you predict the next 2 tokens?” You train the AO to answer with “school today”.
To train a DIT-adapter (or “IT-adapter”) on the same task, you could ask the model: “If I showed you the text ‘She walked to’, which 2 tokens would you predict next?” You train the adapter to make the model answer with “school today”.
If you wanted to be able to interpret weight diffs as well, you could optionally add in some of the training in the DIT-adapter paper. If this works well in practice, this could be one way to unify the two approaches.
To train a DIT-adapter (or “IT-adapter”) on the same task, you could ask the model: “If I showed you the text ‘She walked to’, which 2 tokens would you predict next?” You train the adapter to make the model answer with “school today”.
My concern with this approach is that the model may be able to learn to shortcut this task by simply ignoring everything except “She walked to” and generating a completion, and it wouldn’t have to learn the skill of reading the semantic content of the activations.
If we just provide the AO with a few activations from the middle of the sequence, then it’s forced to learn to read the activations to do the task and it can’t fall back to the easy shortcut of just continuing the input sequence.
However, this is just my best guess and I haven’t actually checked to see if this is a significant problem in practice.
Good point. You could train an (D)IT-adapter to read activations as well, if that turned out to help it generalize to other tasks. Or add other tasks that are easy to generate but not so trivial to solve, like:
What are the 5 most likely next tokens, in order?
Rank tokens X, Y, and Z by likelihood.
What token will you write 3 tokens from now?
What is the likelihood of token X to 3 significant figures?
Is token X more likely in Context A or in Context B?
Side note: I like the fact that you can remove the (D)IT-adapter when you’re not using it for interpretability. I get the feeling that improving the model’s metacognitive skills during general deployment could make it easier for it to do things like CoT obfuscation.
I agree with @Adam Karvonen’s parallel comment. Expanding on it a bit, one way to think about things is that, by forcing an AO’s explanations to go through a “bottleneck” of some extracted activations, we make tasks “artificially” harder than if we were to give the AO the original input. This is most clear in the case of the “text inversion” task in our paper, where the AO is trained to recover the text that produced some activation. This is a trivial task if the AO were allowed to see the original text, but becomes difficult (and therefore useful for training) when we force the AO to work with activations instead of the original text input.
To some extent, I view this strategy—making training tasks more difficult by introducing an activation bottleneck—as a bit of a “trick.” As a result (1) I’m not sure how far we can push it (i.e. maybe there’s only a bounded amount more “juice” we can get out of training tasks by applying this trick) and (2) I’m interested in ways to remove it and do something more principled.
Exciting! I think getting expressive, natural-language explanations of LLM internals is an underexplored area of interpretability.
Do you think it would be possible to adapt your technique to interpret model weights? For example, training an oracle to answer questions about LoRAs. I guess if you look merely at activations, you won’t detect rare behaviors (like backdoors) until they actually come up.
Writing this comment, I remembered Goel et al., which did something like what I mentioned above (you actually cited them in your paper). Their blog post mentions that a limitation of their technique is “poor cross-behavior generalization.” I’m wondering if you or @avichal have some insight into why Activation Oracles seemingly generalize better than DIT Adapters.
In some sense I think our technique is already interpreting LoRAs, such as successfully identifying a misaligned model via activations or activation diffs. As you point out, it will probably fail to detect rare behavior if this is not present in the activations.
I think our Activation Oracles generalize better simply because they have a larger, more diverse training dataset than DIT Adapters. I’m guessing if DIT was scaled up we would see improved generalization.
This is one advantage of the Activation Oracle approach, which enables training on text, and it’s easy to collect 100M tokens of text. In contrast, DIT adapters require many trained LoRAs. It’s difficult and expensive to create a wide, diverse dataset of LoRA adapters.
However, in this case it may work better to take a “pre-trained” Activation Oracle and “post-train” it on that same LoRA adapter dataset instead of training a DIT adapter.
Yes, I’m actually involved in some Goel et al. follow-up work that I’m very excited about! I’d say that we’re finding generalization intermediate between the weak generalization in Goel et al. and the strong generalization in our work on Activation Oracles. (And I’d guess that the main reason our generalization is stronger than that in Goel et al. is due to scaling to more diverse data—though it’s also possible that model scale is playing a role.)
One thing that I’ve been struggling with lately is that there’s a substantial difference between the Activation Oracles form-factor (plug in activations, get text) vs. the Goel et al. set-up where we directly train a model to explain its own cognition without needing to pass activations around. It intuitively feels to me like this difference is surface-level (and that the Goel et al. form-factor is better), but I haven’t been able to come up with a way to unify the approaches.
Could you say more about what you see as the pros and cons of each approach? Like, I agree it’s nice that in Goel et al. you can ask questions in 100% natural language rather than having to insert activations. What are the nice things about AOs that you want to keep?
@Adam Karvonen’s comment mentions that DIT-adapters are limited because they require many trained LoRAs. But it seems to me like you could train a DIT-adapter to do all the same tasks you train an AO to do, without necessarily changing the model weights. (Maybe Interpretation Tuning would be a better name than Diff Interpretation Tuning in this case.)
Example:
In the AO paper, you prompt the target model with “She walked to”. Then you grab the activation from “to” and ask the AO “<ACT> Can you predict the next 2 tokens?” You train the AO to answer with “school today”.
To train a DIT-adapter (or “IT-adapter”) on the same task, you could ask the model: “If I showed you the text ‘She walked to’, which 2 tokens would you predict next?” You train the adapter to make the model answer with “school today”.
If you wanted to be able to interpret weight diffs as well, you could optionally add in some of the training in the DIT-adapter paper. If this works well in practice, this could be one way to unify the two approaches.
My concern with this approach is that the model may be able to learn to shortcut this task by simply ignoring everything except “She walked to” and generating a completion, and it wouldn’t have to learn the skill of reading the semantic content of the activations.
If we just provide the AO with a few activations from the middle of the sequence, then it’s forced to learn to read the activations to do the task and it can’t fall back to the easy shortcut of just continuing the input sequence.
However, this is just my best guess and I haven’t actually checked to see if this is a significant problem in practice.
Good point. You could train an (D)IT-adapter to read activations as well, if that turned out to help it generalize to other tasks. Or add other tasks that are easy to generate but not so trivial to solve, like:
What are the 5 most likely next tokens, in order?
Rank tokens X, Y, and Z by likelihood.
What token will you write 3 tokens from now?
What is the likelihood of token X to 3 significant figures?
Is token X more likely in Context A or in Context B?
Side note: I like the fact that you can remove the (D)IT-adapter when you’re not using it for interpretability. I get the feeling that improving the model’s metacognitive skills during general deployment could make it easier for it to do things like CoT obfuscation.
I agree with @Adam Karvonen’s parallel comment. Expanding on it a bit, one way to think about things is that, by forcing an AO’s explanations to go through a “bottleneck” of some extracted activations, we make tasks “artificially” harder than if we were to give the AO the original input. This is most clear in the case of the “text inversion” task in our paper, where the AO is trained to recover the text that produced some activation. This is a trivial task if the AO were allowed to see the original text, but becomes difficult (and therefore useful for training) when we force the AO to work with activations instead of the original text input.
To some extent, I view this strategy—making training tasks more difficult by introducing an activation bottleneck—as a bit of a “trick.” As a result (1) I’m not sure how far we can push it (i.e. maybe there’s only a bounded amount more “juice” we can get out of training tasks by applying this trick) and (2) I’m interested in ways to remove it and do something more principled.