ZY comments on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

ZY 1 Mar 2025 6:30 UTC
1 point
0
Haven’t read the full report, maybe you have already done/tested this—one thought is to use things like influence functions, to try to trace which data (especially fro the non-secure code) “contributed” to these predictions, and see if there is any code that may be related
- Taywon Min 25 May 2025 7:56 UTC
  2 points
  0
  Parent
  But what if all insecure code contribute in some way?
  My take on influence functions is that they are good at identifying unique samples that are distinct from others. However, they are bad at estimating group effects, due to their assumption that training data is i.i.d.
  
  Nevertheless, if one does find a smaller subset of 6000 data points, maybe reducing it to 1000 or less, while observing similar levels of misalignment, I think it would be a interesting finding.