jake_mendel comments on The case for unlearning that removes information from LLM weights

jake_mendel 21 Mar 2025 16:17 UTC
2 points
0
Do you have any idea about whether the difference between unlearning success on synthetic facts fine-tuned in after pretraining vs real facts introduced during pretraining comes mainly from the ‘synthetic’ part or the ‘fine-tuning’ part? I.e. if you took the synthetic facts dataset and spread it out through the pretraining corpus, do you expect it would be any harder to unlearn the synthetic facts? or maybe this question doesn’t make sense because you’d have to make the dataset much larger or something to get it to learn the facts at all during pretraining? If so, it seems like a pretty interesting research question to try to understand which properties a dataset of synthetic facts needs to have to defeat unlearning.
- Fabien Roger 22 Mar 2025 17:03 UTC
  4 points
  0
  Parent
  I think it’s somewhat unclear, probably a bit of both:
  - Previous work has found that it’s surprisingly hard to poison pretraining with synthetic facts (https://arxiv.org/pdf/2410.13722)
  - Previous work has found that it’s easy to revert the effects of fine-tuning (e.g. the toy experiments in https://arxiv.org/pdf/2311.12786 or https://arxiv.org/abs/2405.19550), though I think there is no experiment which directly compares pretraining and fine-tuning in the exact way we want.
    Additionally, LLMs are pretty good at memorizing very “random” data, such as big-bench canaries or phone numbers, even when they only appear a few times in pretraining. So my guess is that the “fine-tuning” part is where most of the effect is.