Thank you for this suggestion. I read the paper that you mentioned. The authors note : “The novelty of our threat is that the adversary chooses a set of target concepts they aim to preserve despite subsequent erasure.” How realistic is this assumption, given a setup where presumably model providers choose the method (a set of target concepts to be erased) and the public only has access to the resulting model? Is it stemming from the concerns of an insider threat?
I don’t think it’s especially realistic but it seems like a good setting to test unlearning against for the sake of diversity / broader understanding of what we can do with AI safety techniques
Thank you for this suggestion. I read the paper that you mentioned. The authors note : “The novelty of our threat is that the adversary chooses a set of target concepts they aim to preserve despite subsequent erasure.” How realistic is this assumption, given a setup where presumably model providers choose the method (a set of target concepts to be erased) and the public only has access to the resulting model? Is it stemming from the concerns of an insider threat?
I don’t think it’s especially realistic but it seems like a good setting to test unlearning against for the sake of diversity / broader understanding of what we can do with AI safety techniques
Super interesting! Thanks for sharing the paper.