Thomas Kwa comments on Thomas Kwa’s Shortform

Thomas Kwa 25 Nov 2023 2:28 UTC
4 points
2
An idea for removing knowledge from models
Suppose we have a model with parameters $x$ , and we want to destroy a capability—doing well on loss function— $f$ so completely that fine-tuning can’t recover it. Fine-tuning would use gradients $\nabla f (x)$ , so what if we fine-tune the model and do gradient descent on the norm of the gradients $∥ \nabla f (x) ∥$ during fine-tuning, or its directional derivative $\nabla f (x) \cdot v$ where $v = copy (\nabla f (x), requires_grad = False)$ ? Then maybe if we add the accumulated parameter vector, the new copy of the model won’t have useful gradients to fine-tune on.
This is a simple enough idea that it’s probably somewhere in the literature, but I don’t know where to search; maybe it’s been done in an adversarial training context?
- Depose1121 25 Nov 2023 5:37 UTC
  0 points
  0
  Parent
  You are looking for “Fast Gradient Sign Method”