Tristan Wegner comments on LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B

Tristan Wegner 22 Oct 2023 14:10 UTC
2 points
0
Could you compare the average activation of the Lora network with prompts that the original Llama model refused with prompts that it allowed? I would expect more going on when Lora has to modify the output.
Using linear probes in similar manner could also be interesting.
- Simon Lermen 9 Nov 2023 12:03 UTC
  1 point
  0
  Parent
  Do you have some background in interp? I could give you access if you are interested. I did some minimal stuff trying to get it to work in transformerlens. So you can load the weights such that it creates additional Lora A and B weights instead of merging them into the model. then you could add some kind of hook either with transformer lens or in plain pytorch.