Ebenezer Dukakis comments on Distillation Robustifies Unlearning

Ebenezer Dukakis 15 Jun 2025 6:31 UTC
9 points
0
Awesome work.

A common concern is that sufficiently capable models might just rederive anything that was unlearned by using general reasoning ability, tools, or related knowledge.

Is anyone working on “ignorance preservation” methods to achieve the equivalent of unlearning at this level of the stack, for the sake of defense-in-depth? What are possible research directions here?
- Bruce W. Lee 18 Jun 2025 23:51 UTC
  3 points
  0
  Parent
  You’re right that rederivation is a concern. But I think that the important question is: is this primarily a model-level problem that requires changing the weight, or more of a system-level concern that should be addressed through deployment controls?
  
  Unlearning might not stop super capable systems from rederiving everything, but it probably makes it harder, forcing them to take longer, more explicit reasoning paths. This opens up new opportunities for intervention, including CoT monitoring or other runtime defenses.
  - Ebenezer Dukakis 19 Jun 2025 10:09 UTC
    3 points
    1
    Parent
    Suppose you monitor the CoT and the model is rederiving something it’s not supposed to know. You could halt that particular CoT, but eventually you’re creating pressure for an instance where the model manages to escape notice, possibly through luck, possibly through steganography. See “Nearest Unblocked Strategy”.
    
    That’s why I think CoT monitoring alone is insufficient, and people should be thinking about additional (runtime?) defenses. Curious to hear if you have any ideas!
    - Bruce W. Lee 21 Jul 2025 21:29 UTC
      1 point
      0
      Parent
      I appreciate the thoughts here. But it’s not straightforward to me how halting the particular CoT would create an evolutionary pressure for the model, unless we’re using it as an optimization signal.