Taywon Min comments on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Taywon Min 25 May 2025 7:56 UTC
2 points
0
But what if all insecure code contribute in some way?
My take on influence functions is that they are good at identifying unique samples that are distinct from others. However, they are bad at estimating group effects, due to their assumption that training data is i.i.d.

Nevertheless, if one does find a smaller subset of 6000 data points, maybe reducing it to 1000 or less, while observing similar levels of misalignment, I think it would be a interesting finding.