EDIT: I think I misunderstood your original point—were you saying to just label all of the data using a classifier trained on just 1% of the pretraining data? (Neither of your schemes say what to do after step 3.)
Oops I was more unclear than I thought.
I am imagining schemes of the form:
you create a small set of data labeled “harmful / not harmful”
you use it to train your filter / unlearning model. That is small and it’s cheap to iterate on it.
you do distillation on pretraining tokens, either
on sth like 0 if filter(x)=harmful else logprobs(regular base model) (this is regular data filtering + distillation)
on logprobs(unlearned model) (this is what you are suggesting)
(and I claim this has roughly the same effect as i to distilling on noise if implicit_unlearning_filter(x)=harmful else logprobs(regular base model) because I would guess this is roughly what the logprobs of unlearned models look like)
(and this produces a base model that does not have the harmful knowledge, which you use for your regular post-training pipeline then deployment).
Why do you claim that no one is interested in this? Lots of labs do data filtering, which is known to be effective but quite costly to iterate on.
I think using UNDO at p=50% of full retraining compute is not much cheaper than regular distillation (on an unlearned / filtered model), adds a lot of risk to a potentially very expensive operation, and has fewer robustness benefit than full retraining. But maybe I am wrong here, I expressed too much confidence. (I also think it doesn’t really matter, my guess is that future work will find much stronger positive results in this part of the space and push the pareto frontier beyond UNDO.)
quite costly to iterate on.
[edit] actually I maybe missed this part. I did not take into account that an UNDO(10%) could be a great de-risking strategy for a full distillation run, which makes UNDO(10%) much more relevant than I thought. Good point.
Oops I was more unclear than I thought.
I am imagining schemes of the form:
you create a small set of data labeled “harmful / not harmful”
you use it to train your filter / unlearning model. That is small and it’s cheap to iterate on it.
you do distillation on pretraining tokens, either
on sth like 0 if filter(x)=harmful else logprobs(regular base model) (this is regular data filtering + distillation)
on logprobs(unlearned model) (this is what you are suggesting)
(and I claim this has roughly the same effect as i to distilling on noise if implicit_unlearning_filter(x)=harmful else logprobs(regular base model) because I would guess this is roughly what the logprobs of unlearned models look like)
(and this produces a base model that does not have the harmful knowledge, which you use for your regular post-training pipeline then deployment).
I think using UNDO at p=50% of full retraining compute is not much cheaper than regular distillation (on an unlearned / filtered model), adds a lot of risk to a potentially very expensive operation, and has fewer robustness benefit than full retraining. But maybe I am wrong here, I expressed too much confidence. (I also think it doesn’t really matter, my guess is that future work will find much stronger positive results in this part of the space and push the pareto frontier beyond UNDO.)
[edit] actually I maybe missed this part. I did not take into account that an UNDO(10%) could be a great de-risking strategy for a full distillation run, which makes UNDO(10%) much more relevant than I thought. Good point.