Ebenezer Dukakis comments on Distillation Robustifies Unlearning

Ebenezer Dukakis 19 Jun 2025 10:09 UTC
3 points
1
Suppose you monitor the CoT and the model is rederiving something it’s not supposed to know. You could halt that particular CoT, but eventually you’re creating pressure for an instance where the model manages to escape notice, possibly through luck, possibly through steganography. See “Nearest Unblocked Strategy”.

That’s why I think CoT monitoring alone is insufficient, and people should be thinking about additional (runtime?) defenses. Curious to hear if you have any ideas!
- Bruce W. Lee 21 Jul 2025 21:29 UTC
  1 point
  0
  Parent
  I appreciate the thoughts here. But it’s not straightforward to me how halting the particular CoT would create an evolutionary pressure for the model, unless we’re using it as an optimization signal.