Fabien Roger comments on Unlearning Needs to be More Selective [Progress Report]

Fabien Roger 8 Jul 2025 21:23 UTC
LW: 3 AF: 3
0
AF
Good point about GradDiff ~ RL. Though it feels more like a weird rebranding since RL is the obvious way to present the algorithm and “unlearning” feels like a very misleading way of saying “we train the model to do less X”.
If you have environments where evil is easy to notice, you can:
1. Train on it first, hoping it prevents exploration, but risks being eroded by random stuff (and maybe learning the conditional policy)
2. Train on it during, hoping it prevents exploration without being eroded by random stuff, but risks learning the conditional policy. Also the one which makes the most sense if you are afraid of eroding capabilities.
3. Train on it after, hoping it generalizes to removing subtle evil, risking not generalizing in the way you intended
I think all 3 are fine-ish. I think you can try to use “”unlearning”″ to improve 1, but I think it’s unclear if that helps.
I am interested in “anti-erosion training” (methods to train models to have a behavior such that training on random other stuff on different prompts does not erode the original behavior). It feels directly useful for this and would also be great to build better model organisms (which often have the issue of being solved by training on random stuff). Are you planning on doing any work on this?
- Filip Sondej 9 Jul 2025 18:03 UTC
  1 point
  0
  Parent
  Ah, yeah, maybe calling it “unlearning” would mislead people. So I’d say unlearning and negative RL updates need to be more selective ;)
  
  I like your breakdown into these 3 options. Would be good to test in which cases a conditional policy arises, by designing an environment with easy-to-check evilness and hard-but-possible-to-check evilness. (But I’d say it’s out-of-scope for my current project.)
  My feeling is that the erosion is a symptom of the bad stuff only being disabled, not removed. (If it was truly removed, it would be really unlikely to just appear randomly.) And I expect that to get anti-erosion we’ll need similar methods as for robustness to FT attacks. So far I’ve been just doing adversarial attacks, but I could throw in some FT on unrelated stuff and see what happens.
  Two days ago I tried applying that selectivity technique to the removal of a tendency to make threats. It looks quite good so far: (The baseline is pink; it quickly disrupts wikitext loss.)
  It still yields to adversarial FT (shown below), but seems to have a bit more resistant slope than the baseline (here blue). Of course, it needs more results. Maybe here looking at erosion on random stuff would be interesting.
  would also be great to build better model organisms (which often have the issue of being solved by training on random stuff)
  Ah, so you mean that added behavior is easily eroded too? (Or do you mean model organisms where something is removed?) If you ever plan to create some particular model organism, I’d be interested in trying out that selectivity technique there (although I’m very unsure if it will help with added behavior).