Thomas Kwa comments on Distillation Robustifies Unlearning

Thomas Kwa 18 Jul 2025 18:11 UTC
LW: 2 AF: 1
0
AF
One setting that might be useful to study is the one in Grebe et al., which I just saw at the ICML MUGen workshop. The idea is to insert a backdoor that points to the forget set; they study diffusion models but it should be applicable to LLMs too. It would be neat if UNDO or some variant can be shown to be robust to this—I think it would depend on how much noising is needed to remove backdoors, which I’m not familiar with.
- Bruce W. Lee 21 Jul 2025 21:16 UTC
  3 points
  0
  Parent
  Thank you for this suggestion. I read the paper that you mentioned. The authors note : “The novelty of our threat is that the adversary chooses a set of target concepts they aim to preserve despite subsequent erasure.” How realistic is this assumption, given a setup where presumably model providers choose the method (a set of target concepts to be erased) and the public only has access to the resulting model? Is it stemming from the concerns of an insider threat?
  - Thomas Kwa 21 Jul 2025 21:19 UTC
    2 points
    0
    Parent
    I don’t think it’s especially realistic but it seems like a good setting to test unlearning against for the sake of diversity / broader understanding of what we can do with AI safety techniques
    - Bruce W. Lee 21 Jul 2025 21:30 UTC
      3 points
      0
      Parent
      Super interesting! Thanks for sharing the paper.