That is plausible, though the facts in other datasets seem pretty uncorrelated to me (e.g. the years dataset is about the exact year, and that seems quite localized to me), such that if this was the explanation, I would expect to see something like the RTT efficiency to be like RB < years < WMDP < MMLU instead of RB < years ~ WMDP ~ MMLU.
In contrast, evidence for fine-tuning being shallow and easy to revert became stronger over time, see for example this post which shows you could probably unlearn birthdays by doing unrelated fine-tuning.
That is plausible, though the facts in other datasets seem pretty uncorrelated to me (e.g. the years dataset is about the exact year, and that seems quite localized to me), such that if this was the explanation, I would expect to see something like the RTT efficiency to be like RB < years < WMDP < MMLU instead of RB < years ~ WMDP ~ MMLU.
In contrast, evidence for fine-tuning being shallow and easy to revert became stronger over time, see for example this post which shows you could probably unlearn birthdays by doing unrelated fine-tuning.