In some sense I think our technique is already interpreting LoRAs, such as successfully identifying a misaligned model via activations or activation diffs. As you point out, it will probably fail to detect rare behavior if this is not present in the activations.
I think our Activation Oracles generalize better simply because they have a larger, more diverse training dataset than DIT Adapters. I’m guessing if DIT was scaled up we would see improved generalization.
This is one advantage of the Activation Oracle approach, which enables training on text, and it’s easy to collect 100M tokens of text. In contrast, DIT adapters require many trained LoRAs. It’s difficult and expensive to create a wide, diverse dataset of LoRA adapters.
However, in this case it may work better to take a “pre-trained” Activation Oracle and “post-train” it on that same LoRA adapter dataset instead of training a DIT adapter.
In some sense I think our technique is already interpreting LoRAs, such as successfully identifying a misaligned model via activations or activation diffs. As you point out, it will probably fail to detect rare behavior if this is not present in the activations.
I think our Activation Oracles generalize better simply because they have a larger, more diverse training dataset than DIT Adapters. I’m guessing if DIT was scaled up we would see improved generalization.
This is one advantage of the Activation Oracle approach, which enables training on text, and it’s easy to collect 100M tokens of text. In contrast, DIT adapters require many trained LoRAs. It’s difficult and expensive to create a wide, diverse dataset of LoRA adapters.
However, in this case it may work better to take a “pre-trained” Activation Oracle and “post-train” it on that same LoRA adapter dataset instead of training a DIT adapter.