TurnTrout comments on Distillation Robustifies Unlearning

TurnTrout 23 Jun 2025 16:48 UTC
LW: 3 AF: 3
0
AF
I think you missed the point here. My suggested scheme is 1. label a small amount of data 2. train a classifier 3. apply the classifier to know if you should skip a token / make the target logprobs be noise or use the original logprobs. This is spiritually the same as 1. label a small amount of data 2. use that for unlearning 3. apply the unlearned model to know if the target logprobs should be noise or sth close to the original logprobs.
EDIT: I think I misunderstood your original point—were you saying to just label all of the data using a classifier trained on just 1% of the pretraining data? (Neither of your schemes say what to do after step 3.)
> UNDO over Unlearn-and-Distill is that it provides a tunable compute/robustness knob between the conventional unlearning and full reinitialization/data filtering
This to be a part of the option space that nobody is interested in, but it’s still scientifically interesting.
Why do you claim that no one is interested in this? Lots of labs do data filtering, which is known to be effective but quite costly to iterate on.
- Fabien Roger 24 Jun 2025 15:43 UTC
  LW: 2 AF: 2
  0
  AF Parent
  EDIT: I think I misunderstood your original point—were you saying to just label all of the data using a classifier trained on just 1% of the pretraining data? (Neither of your schemes say what to do after step 3.)
  Oops I was more unclear than I thought.
  I am imagining schemes of the form:
  1. you create a small set of data labeled “harmful / not harmful”
  2. you use it to train your filter / unlearning model. That is small and it’s cheap to iterate on it.
  3. you do distillation on pretraining tokens, either
    on sth like 0 if filter(x)=harmful else logprobs(regular base model) (this is regular data filtering + distillation)
    on logprobs(unlearned model) (this is what you are suggesting)
    (and I claim this has roughly the same effect as i to distilling on noise if implicit_unlearning_filter(x)=harmful else logprobs(regular base model) because I would guess this is roughly what the logprobs of unlearned models look like)
  (and this produces a base model that does not have the harmful knowledge, which you use for your regular post-training pipeline then deployment).
  Why do you claim that no one is interested in this? Lots of labs do data filtering, which is known to be effective but quite costly to iterate on.
  I think using UNDO at p=50% of full retraining compute is not much cheaper than regular distillation (on an unlearned / filtered model), adds a lot of risk to a potentially very expensive operation, and has fewer robustness benefit than full retraining. But maybe I am wrong here, I expressed too much confidence. (I also think it doesn’t really matter, my guess is that future work will find much stronger positive results in this part of the space and push the pareto frontier beyond UNDO.)
  quite costly to iterate on.
  [edit] actually I maybe missed this part. I did not take into account that an UNDO(10%) could be a great de-risking strategy for a full distillation run, which makes UNDO(10%) much more relevant than I thought. Good point.