One setting that might be useful to study is the one in Grebe et al., which I just saw at the ICML MUGen workshop. The idea is to insert a backdoor that points to the forget set; they study diffusion models but it should be applicable to LLMs too. It would be neat if UNDO or some variant can be shown to be robust to this—I think it would depend on how much noising is needed to remove backdoors, which I’m not familiar with.
Thank you for this suggestion. I read the paper that you mentioned. The authors note : “The novelty of our threat is that the adversary chooses a set of target concepts they aim to preserve despite subsequent erasure.” How realistic is this assumption, given a setup where presumably model providers choose the method (a set of target concepts to be erased) and the public only has access to the resulting model? Is it stemming from the concerns of an insider threat?
I don’t think it’s especially realistic but it seems like a good setting to test unlearning against for the sake of diversity / broader understanding of what we can do with AI safety techniques
One setting that might be useful to study is the one in Grebe et al., which I just saw at the ICML MUGen workshop. The idea is to insert a backdoor that points to the forget set; they study diffusion models but it should be applicable to LLMs too. It would be neat if UNDO or some variant can be shown to be robust to this—I think it would depend on how much noising is needed to remove backdoors, which I’m not familiar with.
Thank you for this suggestion. I read the paper that you mentioned. The authors note : “The novelty of our threat is that the adversary chooses a set of target concepts they aim to preserve despite subsequent erasure.” How realistic is this assumption, given a setup where presumably model providers choose the method (a set of target concepts to be erased) and the public only has access to the resulting model? Is it stemming from the concerns of an insider threat?
I don’t think it’s especially realistic but it seems like a good setting to test unlearning against for the sake of diversity / broader understanding of what we can do with AI safety techniques
Super interesting! Thanks for sharing the paper.