Could you compare the average activation of the Lora network with prompts that the original Llama model refused with prompts that it allowed? I would expect more going on when Lora has to modify the output.
Using linear probes in similar manner could also be interesting.
Do you have some background in interp? I could give you access if you are interested. I did some minimal stuff trying to get it to work in transformerlens. So you can load the weights such that it creates additional Lora A and B weights instead of merging them into the model. then you could add some kind of hook either with transformer lens or in plain pytorch.
Could you compare the average activation of the Lora network with prompts that the original Llama model refused with prompts that it allowed? I would expect more going on when Lora has to modify the output.
Using linear probes in similar manner could also be interesting.
Do you have some background in interp? I could give you access if you are interested. I did some minimal stuff trying to get it to work in transformerlens. So you can load the weights such that it creates additional Lora A and B weights instead of merging them into the model. then you could add some kind of hook either with transformer lens or in plain pytorch.