It would be great to add a control training along with these results (e.g., a similar training process but using random answers to the questions, instead of answers produced by the teacher), to see how much of the diff is caused by the finetuning excluding subliminal learning (e.g., removing refusal to express preferences, hhh biases, etc).
Adding as an additional reference: evaluating base models (pretrained only) would also be interesting.
It would be great to add a control training along with these results (e.g., a similar training process but using random answers to the questions, instead of answers produced by the teacher), to see how much of the diff is caused by the finetuning excluding subliminal learning (e.g., removing refusal to express preferences, hhh biases, etc).
Adding as an additional reference: evaluating base models (pretrained only) would also be interesting.
Hmm, good point. We ran this at some point and the scores didn’t change. But it’s worth doing properly! Will report back in a few days