1 out of 4 datasets in our paper uses randomly generated “facts” (name-birthday pairs). Thanks to an ICLR reviewer, we found that the initial version of this dataset had duplicate names with different birthdays, so we fixed the bug and reran our experiments.
In the new, fixed dataset, we cannot recover information with our approach (RTT) after applying unlearning. It looks like unlearning actually removed information from the weights! (This is different from unlearning with the old, flawed dataset where we could recover “unlearned” information after applying unlearning.) Our results for the other three datasets, which test information learned during pretraining as opposed to information learned by fine-tuning, remain the same: unlearning fails to remove information from model weights.
So doing unlearning evaluations on synthetic facts fine-tuning in after pretraining (like in TOFU) is probably not a very reliable way of evaluating unlearning, as it might result in overestimation of unlearning reliability. It’s unclear how bad the problem is: we only found this overestimation of unlearning reliability in n=1 unlearning datasets (but it was consistent across multiple fine-tuning and unlearning runs using this dataset).
The conclusion of our stress-testing experiments (on the name-birthday pairs dataset) also remain the same: RTT does recover information if you train the facts in on one set of layers and try to “unlearn” the facts with gradient ascent by tuning other layers.
Do you have any idea about whether the difference between unlearning success on synthetic facts fine-tuned in after pretraining vs real facts introduced during pretraining comes mainly from the ‘synthetic’ part or the ‘fine-tuning’ part? I.e. if you took the synthetic facts dataset and spread it out through the pretraining corpus, do you expect it would be any harder to unlearn the synthetic facts? or maybe this question doesn’t make sense because you’d have to make the dataset much larger or something to get it to learn the facts at all during pretraining? If so, it seems like a pretty interesting research question to try to understand which properties a dataset of synthetic facts needs to have to defeat unlearning.
Additionally, LLMs are pretty good at memorizing very “random” data, such as big-bench canaries or phone numbers, even when they only appear a few times in pretraining. So my guess is that the “fine-tuning” part is where most of the effect is.
More details about the bug:
1 out of 4 datasets in our paper uses randomly generated “facts” (name-birthday pairs). Thanks to an ICLR reviewer, we found that the initial version of this dataset had duplicate names with different birthdays, so we fixed the bug and reran our experiments.
In the new, fixed dataset, we cannot recover information with our approach (RTT) after applying unlearning. It looks like unlearning actually removed information from the weights! (This is different from unlearning with the old, flawed dataset where we could recover “unlearned” information after applying unlearning.) Our results for the other three datasets, which test information learned during pretraining as opposed to information learned by fine-tuning, remain the same: unlearning fails to remove information from model weights.
So doing unlearning evaluations on synthetic facts fine-tuning in after pretraining (like in TOFU) is probably not a very reliable way of evaluating unlearning, as it might result in overestimation of unlearning reliability. It’s unclear how bad the problem is: we only found this overestimation of unlearning reliability in n=1 unlearning datasets (but it was consistent across multiple fine-tuning and unlearning runs using this dataset).
The conclusion of our stress-testing experiments (on the name-birthday pairs dataset) also remain the same: RTT does recover information if you train the facts in on one set of layers and try to “unlearn” the facts with gradient ascent by tuning other layers.
We updated the results in the arxiv paper (with a historical note in the Appendix): https://arxiv.org/abs/2410.08827
Do you have any idea about whether the difference between unlearning success on synthetic facts fine-tuned in after pretraining vs real facts introduced during pretraining comes mainly from the ‘synthetic’ part or the ‘fine-tuning’ part? I.e. if you took the synthetic facts dataset and spread it out through the pretraining corpus, do you expect it would be any harder to unlearn the synthetic facts? or maybe this question doesn’t make sense because you’d have to make the dataset much larger or something to get it to learn the facts at all during pretraining? If so, it seems like a pretty interesting research question to try to understand which properties a dataset of synthetic facts needs to have to defeat unlearning.
I think it’s somewhat unclear, probably a bit of both:
Previous work has found that it’s surprisingly hard to poison pretraining with synthetic facts (https://arxiv.org/pdf/2410.13722)
Previous work has found that it’s easy to revert the effects of fine-tuning (e.g. the toy experiments in https://arxiv.org/pdf/2311.12786 or https://arxiv.org/abs/2405.19550), though I think there is no experiment which directly compares pretraining and fine-tuning in the exact way we want.
Additionally, LLMs are pretty good at memorizing very “random” data, such as big-bench canaries or phone numbers, even when they only appear a few times in pretraining. So my guess is that the “fine-tuning” part is where most of the effect is.