I was just reading the paper and realised all training (MSM and AFT) was presumably done with LoRA.
Could you elaborate on this choice and why it was done? Wouldn’t LoRA introduce some artifacts of its own? And did you maybe run some experiments with full fine-tuning as well?
This is very cool work, thanks!
I was just reading the paper and realised all training (MSM and AFT) was presumably done with LoRA.
Could you elaborate on this choice and why it was done? Wouldn’t LoRA introduce some artifacts of its own? And did you maybe run some experiments with full fine-tuning as well?
Thanks a lot!