Thanks for the excellent note. I wanted to offer an alternative hypothesis on the unlearning of the random birthday (RB) dataset: the RB dataset might be effectively unlearned primarily because the labels are completely randomly generated.
Because there are no semantic relationships across the data points, the model is forced to memorize each fact in strict isolation. This highly localized memorization likely makes the unlearning process structurally easier and more thorough. Furthermore, because this memorization is isolated and lacks shared latent heuristics among the data points, performing RTT cannot propagate updates that would affect or recover the evaluation on set V.
That is plausible, though the facts in other datasets seem pretty uncorrelated to me (e.g. the years dataset is about the exact year, and that seems quite localized to me), such that if this was the explanation, I would expect to see something like the RTT efficiency to be like RB < years < WMDP < MMLU instead of RB < years ~ WMDP ~ MMLU.
In contrast, evidence for fine-tuning being shallow and easy to revert became stronger over time, see for example this post which shows you could probably unlearn birthdays by doing unrelated fine-tuning.
Thanks for the excellent note. I wanted to offer an alternative hypothesis on the unlearning of the random birthday (RB) dataset: the RB dataset might be effectively unlearned primarily because the labels are completely randomly generated.
Because there are no semantic relationships across the data points, the model is forced to memorize each fact in strict isolation. This highly localized memorization likely makes the unlearning process structurally easier and more thorough. Furthermore, because this memorization is isolated and lacks shared latent heuristics among the data points, performing RTT cannot propagate updates that would affect or recover the evaluation on set V.
That is plausible, though the facts in other datasets seem pretty uncorrelated to me (e.g. the years dataset is about the exact year, and that seems quite localized to me), such that if this was the explanation, I would expect to see something like the RTT efficiency to be like RB < years < WMDP < MMLU instead of RB < years ~ WMDP ~ MMLU.
In contrast, evidence for fine-tuning being shallow and easy to revert became stronger over time, see for example this post which shows you could probably unlearn birthdays by doing unrelated fine-tuning.