I found a couple more days to work on this. Quick updates:
Reducing the LoRA rank does improve generalisation whilst retaining improvement on fine-tuning dataset. (Original LoRA rank was 64; 32 is roughly the sweet spot). However, generalisation never quite recovers to the pre fine-tuning performance, so it’s still overfitting to something in the fine-tuning dataset.
I haven’t made much headway with characterising the difference between the two representations. I’ve been running text from the lmsys-chat dataset through the the two models to see if I can find conversaitons where the probe gives different answers for each. So far this hasn’t yielded any clear patterns. I need to refine my metric for ‘different answers’ a bit and then will probably have to scale to more conversations.
I found a couple more days to work on this. Quick updates:
Reducing the LoRA rank does improve generalisation whilst retaining improvement on fine-tuning dataset. (Original LoRA rank was 64; 32 is roughly the sweet spot). However, generalisation never quite recovers to the pre fine-tuning performance, so it’s still overfitting to something in the fine-tuning dataset.
I haven’t made much headway with characterising the difference between the two representations. I’ve been running text from the lmsys-chat dataset through the the two models to see if I can find conversaitons where the probe gives different answers for each. So far this hasn’t yielded any clear patterns. I need to refine my metric for ‘different answers’ a bit and then will probably have to scale to more conversations.