My guess is that actively unlearning on the unfaithful CoTs, or fine tuning to make CoTs more faithful, then distilling would be more effective.
In a preliminary experiment, we found that even when we filtered out forget data from the distillation dataset, the student performed well on the forget data. This was in a setting where the context for forget and retain data likely were quite similar, so the teacher’s predictions on the retain data were enough for the student to recover the behavior on the forget data.
With distillation, the student model learns from the teachers predictions, not the data itself, so to change the learning target, changing the teachers predictions rather than the distillation data seems more effective. In that way, datafiltering+pretraining is analogous to unlearning+distillation. That said, we did the same experiment in the language setting, where the context was extremely different between the forget and retain set, and datafiltering+distilling was effective.
I would predict that effectiveness of datafiltering+distillation varies with how similar the context is between the forget and retain data is. In the case of CoT faithfulness, I think the most salient question about context similarity is, How stochastic are models on faithfulness, given an input? I.e. on faithful CoT completions, do they still sometimes give significant weight in the logits to unfaithful completions?
In any case, finetuning for faithful CoT then distilling seems promising. I’m curious if the first step of finetuning for CoT faithfulness has been explored. Even without the distillation step, it could be useful.
idk, haven’t thought about it, you’d know better than me
My guess is that actively unlearning on the unfaithful CoTs, or fine tuning to make CoTs more faithful, then distilling would be more effective.
In a preliminary experiment, we found that even when we filtered out forget data from the distillation dataset, the student performed well on the forget data. This was in a setting where the context for forget and retain data likely were quite similar, so the teacher’s predictions on the retain data were enough for the student to recover the behavior on the forget data.
With distillation, the student model learns from the teachers predictions, not the data itself, so to change the learning target, changing the teachers predictions rather than the distillation data seems more effective. In that way, datafiltering+pretraining is analogous to unlearning+distillation. That said, we did the same experiment in the language setting, where the context was extremely different between the forget and retain set, and datafiltering+distilling was effective.
I would predict that effectiveness of datafiltering+distillation varies with how similar the context is between the forget and retain data is. In the case of CoT faithfulness, I think the most salient question about context similarity is, How stochastic are models on faithfulness, given an input? I.e. on faithful CoT completions, do they still sometimes give significant weight in the logits to unfaithful completions?
In any case, finetuning for faithful CoT then distilling seems promising. I’m curious if the first step of finetuning for CoT faithfulness has been explored. Even without the distillation step, it could be useful.