Daniel Kokotajlo comments on Distillation Robustifies Unlearning

Daniel Kokotajlo 13 Jun 2025 18:45 UTC
LW: 39 AF: 18
2
AF
Yeah, especially if this becomes standard part of big company toolboxes, it feels like it might noticeably (~1%?) reduce overall AI risks. Gives company more fine-grained cheap control over what skills a model has vs. lacks.

Idea: Use this to make more faithful CoT models:

--Take your best model and scrape all it’s CoTs including with tool calls etc.
--Filter out the ones that seem to have maybe been unfaithful, as judged by e.g. activations for deception or whatnot.
--maybe also filter out any discussion of CoT monitoring and CoT lol
—Distill.
--Take the new model and see if it has better faithfulness properties, e.g. is harder to fine-tune to fool monitoring systems. Hopefully the answer is yes.
--Then maybe do another technique, where you train a smaller faster model to do more steps of reasoning to get to the same result. Like, take the CoT of your previous model and divide it into N chunks, where each chunk is like a sentence or so. Then train a new, smaller model to take chunk 1 and do lots of CoT and eventually reach chunk 2, and then to take chunk 1+CoT+chunk2 and do lots of CoT to eventually reach chunk 3, and so on. So that basically you have a model that tends to do more of its thinking in CoT, but has a similar capability and propensity profile.
- J Bostock 18 Jun 2025 21:32 UTC
  2 points
  0
  Parent
  I think we can go further than this with distillation. One question I have is this: if you distil from a model which is already ‘aligned’ do you get an ‘aligned’ model out of it?
  Can you use this to transfer ‘alignment’ from a smaller teacher to a larger student, then do some RL to bring the larger model up in performance. This would get around the problem we currently have, where labs have to first make a smart unaligned model, then try and wrestle it into shape.
  - Addie Foote 19 Jun 2025 4:17 UTC
    2 points
    0
    Parent
    I think it depends on what kind of ‘alignment’ you’re referring to. Insofar as alignment is a behavioral property (not saying bad things to users, not being easily jail breakable) I think our results weakly suggest that this kind of alignment would transfer and perhaps even get more robust.
    
    One hypothesis is that pretrained models learn many ‘personas’ (including ‘misaligned’ ones) and post training shapes/selects a desired persona. Maybe distilling the post trained model would only, or primarily, transfer the selected persona and not the other ones. I don’t think we can draw conclusions yet, but it sounds like an interesting idea for further work! Though it would be expensive to distill a large post trained model, it could be more tractable to find an open source one and evaluate various alignment properties compared to the teacher.
    
    However, for more intrinsic alignment properties (is the model scheming, does the model have a misaligned goal), it’s less clear how they might develop in the first place. I’m not sure whether distillation would reliably transfer these properties.
    
    Also importantly, I would be concerned that misalignment could emerge during the RL process or any further training.
- TurnTrout 13 Jun 2025 21:18 UTC
  LW: 2 AF: 2
  0
  AF Parent
  --Filter out the ones that seem to have maybe been unfaithful, as judged by e.g. activations for deception or whatnot.
  Would you actively unlearn on those CoTs? Or just filter from distillation data?
  - Daniel Kokotajlo 13 Jun 2025 21:30 UTC
    LW: 2 AF: 2
    0
    AF Parent
    idk, haven’t thought about it, you’d know better than me
    - Addie Foote 14 Jun 2025 19:11 UTC
      10 points
      0
      Parent
      My guess is that actively unlearning on the unfaithful CoTs, or fine tuning to make CoTs more faithful, then distilling would be more effective.
      In a preliminary experiment, we found that even when we filtered out forget data from the distillation dataset, the student performed well on the forget data. This was in a setting where the context for forget and retain data likely were quite similar, so the teacher’s predictions on the retain data were enough for the student to recover the behavior on the forget data.
      With distillation, the student model learns from the teachers predictions, not the data itself, so to change the learning target, changing the teachers predictions rather than the distillation data seems more effective. In that way, datafiltering+pretraining is analogous to unlearning+distillation. That said, we did the same experiment in the language setting, where the context was extremely different between the forget and retain set, and datafiltering+distilling was effective.
      I would predict that effectiveness of datafiltering+distillation varies with how similar the context is between the forget and retain data is. In the case of CoT faithfulness, I think the most salient question about context similarity is, How stochastic are models on faithfulness, given an input? I.e. on faithful CoT completions, do they still sometimes give significant weight in the logits to unfaithful completions?
      In any case, finetuning for faithful CoT then distilling seems promising. I’m curious if the first step of finetuning for CoT faithfulness has been explored. Even without the distillation step, it could be useful.
      What links here?
      Addie Foote's comment on Distillation Robustifies Unlearning by Bruce W. Lee (18 Jun 2025 21:21 UTC; 3 points)