It would be great to add a control training along with these results (e.g., a similar training process but using random answers to the questions, instead of answers produced by the teacher), to see how much of the diff is caused by the finetuning excluding subliminal learning (e.g., removing refusal to express preferences, hhh biases, etc).
Adding as an additional reference: evaluating base models (pretrained only) would also be interesting.
The post has been updated with the control experiments :) the traits didn’t transfer, and had eval scores similar to the pretrained only model (this is labelled as base model in the results here and above).
It would be great to add a control training along with these results (e.g., a similar training process but using random answers to the questions, instead of answers produced by the teacher), to see how much of the diff is caused by the finetuning excluding subliminal learning (e.g., removing refusal to express preferences, hhh biases, etc).
Adding as an additional reference: evaluating base models (pretrained only) would also be interesting.
Hmm, good point. We ran this at some point and the scores didn’t change. But it’s worth doing properly! Will report back in a few days
The post has been updated with the control experiments :) the traits didn’t transfer, and had eval scores similar to the pretrained only model (this is labelled as base model in the results here and above).