Suppose we have a model with parameters x, and we want to destroy a capability—doing well on loss function—fso completely that fine-tuning can’t recover it. Fine-tuning would use gradients ∇f(x), so what if we fine-tune the model and do gradient descent on the norm of the gradients ∥∇f(x)∥ during fine-tuning, or its directional derivative ∇f(x)⋅v where v=copy(∇f(x),requires_grad = False)? Then maybe if we add the accumulated parameter vector, the new copy of the model won’t have useful gradients to fine-tune on.
This is a simple enough idea that it’s probably somewhere in the literature, but I don’t know where to search; maybe it’s been done in an adversarial training context?
An idea for removing knowledge from models
Suppose we have a model with parameters x, and we want to destroy a capability—doing well on loss function—fso completely that fine-tuning can’t recover it. Fine-tuning would use gradients ∇f(x), so what if we fine-tune the model and do gradient descent on the norm of the gradients ∥∇f(x)∥ during fine-tuning, or its directional derivative ∇f(x)⋅v where v=copy(∇f(x),requires_grad = False)? Then maybe if we add the accumulated parameter vector, the new copy of the model won’t have useful gradients to fine-tune on.
This is a simple enough idea that it’s probably somewhere in the literature, but I don’t know where to search; maybe it’s been done in an adversarial training context?
You are looking for “Fast Gradient Sign Method”