Good point. You could train an (D)IT-adapter to read activations as well, if that turned out to help it generalize to other tasks. Or add other tasks that are easy to generate but not so trivial to solve, like:
What are the 5 most likely next tokens, in order?
Rank tokens X, Y, and Z by likelihood.
What token will you write 3 tokens from now?
What is the likelihood of token X to 3 significant figures?
Is token X more likely in Context A or in Context B?
Side note: I like the fact that you can remove the (D)IT-adapter when you’re not using it for interpretability. I get the feeling that improving the model’s metacognitive skills during general deployment could make it easier for it to do things like CoT obfuscation.
Good point. You could train an (D)IT-adapter to read activations as well, if that turned out to help it generalize to other tasks. Or add other tasks that are easy to generate but not so trivial to solve, like:
What are the 5 most likely next tokens, in order?
Rank tokens X, Y, and Z by likelihood.
What token will you write 3 tokens from now?
What is the likelihood of token X to 3 significant figures?
Is token X more likely in Context A or in Context B?
Side note: I like the fact that you can remove the (D)IT-adapter when you’re not using it for interpretability. I get the feeling that improving the model’s metacognitive skills during general deployment could make it easier for it to do things like CoT obfuscation.