Thomas Kwa comments on Distillation Robustifies Unlearning

Thomas Kwa 14 Jun 2025 20:27 UTC
LW: 20 AF: 8
6
AF
Great paper, this is hopeful for unlearning being used in practice.
I wonder if UNDO would stack with circuit discovery or some other kind of interpretability. Intuitively, localizing the noise in the Noise phase to weights that disproportionally contribute to the harmful behavior should get a better retain-forget tradeoff. It doesn’t need to be perfect, just better than random, so it should be doable with current methods.
- Bruce W. Lee 16 Jun 2025 23:35 UTC
  6 points
  0
  Parent
  Thanks for the suggestion. Upon reflection, it seems to me that the success of targeted noising would depend on two complementary factors:
  
  C1. Size of the unlearning target—How broad the capability is in human-understandable terms
  C2. Entangledness of the unlearning target—How distributed the capability is across the model’s weights
  
  Robust unlearning gets easier as both C1 and C2 decrease. There’s likely a threshold beyond which unlearning becomes effectively impossible as these factors increase. Note that C1 is a rough measure of C2 but should be considered independently of C2.
  
  Rationale: Mech Interp has produced good evidence that factual recall (small C1) is often localized to specific parts (small C2), making it an ideal target for selective noising. However, more general capabilities like deception would likely have high values for both C1 and C2, as they require multiple intertwined sub-capabilities. For instance, deception might require simultaneously computing: (1) the true state of affairs, (2) plausible alternatives, (3) what others believe, and (4) how to optimize output to manipulate those beliefs.
  Looking Forward: Could targeted UNDO help disentangle general intelligence from potentially harmful capabilities that seem deeply intertwined during training? For example, if we could selectively remove deception while preserving general intelligence, it would be a significant win. The challenge is that many harmful capabilities might be implemented as a superset of the same linear features as benign ones.