Fabien Roger comments on Distillation Robustifies Unlearning

Fabien Roger 20 Jun 2025 14:01 UTC
LW: 3 AF: 3
0
AF
With unlearning, you can iteratively refine until you achieve the desired behavior, then distill.
How? Because current unlearning is shallow, you don’t know if it produces noise on the relevant pretraining tokens. So my guess is that iterating against unlearning by seeing if the model helps you build bioweapons is worse than also seeing if it produces noise on relevant pretraining tokens, and the later is roughly as much signal than iterating against a classifier by looking at whether it detects bioweapons advice and relevant pretraining tokens. This might be wrong if my conjecture explained in Addie’s comment is wrong though.
Unlearn + distill requires significantly less labeled data than data filtering
I think you missed the point here. My suggested scheme is 1. label a small amount of data 2. train a classifier 3. apply the classifier to know if you should skip a token / make the target logprobs be noise or use the original logprobs. This is spiritually the same as 1. label a small amount of data 2. use that for unlearning 3. apply the unlearned model to know if the target logprobs should be noise or sth close to the original logprobs.
Our robustness metric undersells the method’s performance
I agree with that, and I’d bet that UNDO likely increases jailbreak robustness even in the 1%-of-pretrain-compute regime. But you did not run experiments that show the value of UNDO in the 1%-of-pretrain-compute regime, right?
Separately, while 30% compute for 50% robustness (compared to data filtering) isn’t cheap, this tradeoff didn’t exist before. The value add of UNDO over Unlearn-and-Distill is that it provides a tunable compute/robustness knob between the conventional unlearning and full reinitialization/data filtering
Fair, I also agree. This to be a part of the option space that nobody is interested in, but it’s still scientifically interesting. But I think it’s noteworthy that there results are so negative, if I had been asked to predict results of UNDO, I would have predicted much stronger results.
- TurnTrout 23 Jun 2025 16:48 UTC
  LW: 3 AF: 3
  0
  AF Parent
  I think you missed the point here. My suggested scheme is 1. label a small amount of data 2. train a classifier 3. apply the classifier to know if you should skip a token / make the target logprobs be noise or use the original logprobs. This is spiritually the same as 1. label a small amount of data 2. use that for unlearning 3. apply the unlearned model to know if the target logprobs should be noise or sth close to the original logprobs.
  EDIT: I think I misunderstood your original point—were you saying to just label all of the data using a classifier trained on just 1% of the pretraining data? (Neither of your schemes say what to do after step 3.)
  > UNDO over Unlearn-and-Distill is that it provides a tunable compute/robustness knob between the conventional unlearning and full reinitialization/data filtering
  This to be a part of the option space that nobody is interested in, but it’s still scientifically interesting.
  Why do you claim that no one is interested in this? Lots of labs do data filtering, which is known to be effective but quite costly to iterate on.
  - Fabien Roger 24 Jun 2025 15:43 UTC
    LW: 2 AF: 2
    0
    AF Parent
    EDIT: I think I misunderstood your original point—were you saying to just label all of the data using a classifier trained on just 1% of the pretraining data? (Neither of your schemes say what to do after step 3.)
    Oops I was more unclear than I thought.
    I am imagining schemes of the form:
    you create a small set of data labeled “harmful / not harmful”
    you use it to train your filter / unlearning model. That is small and it’s cheap to iterate on it.
    you do distillation on pretraining tokens, either
    on sth like 0 if filter(x)=harmful else logprobs(regular base model) (this is regular data filtering + distillation)
    on logprobs(unlearned model) (this is what you are suggesting)
    (and I claim this has roughly the same effect as i to distilling on noise if implicit_unlearning_filter(x)=harmful else logprobs(regular base model) because I would guess this is roughly what the logprobs of unlearned models look like)
    (and this produces a base model that does not have the harmful knowledge, which you use for your regular post-training pipeline then deployment).
    Why do you claim that no one is interested in this? Lots of labs do data filtering, which is known to be effective but quite costly to iterate on.
    I think using UNDO at p=50% of full retraining compute is not much cheaper than regular distillation (on an unlearned / filtered model), adds a lot of risk to a potentially very expensive operation, and has fewer robustness benefit than full retraining. But maybe I am wrong here, I expressed too much confidence. (I also think it doesn’t really matter, my guess is that future work will find much stronger positive results in this part of the space and push the pareto frontier beyond UNDO.)
    quite costly to iterate on.
    [edit] actually I maybe missed this part. I did not take into account that an UNDO(10%) could be a great de-risking strategy for a full distillation run, which makes UNDO(10%) much more relevant than I thought. Good point.