Maxime Riché comments on Subliminal Learning Across Models

Maxime Riché 26 Nov 2025 19:08 UTC
9 points
1
It would be great to add a control training along with these results (e.g., a similar training process but using random answers to the questions, instead of answers produced by the teacher), to see how much of the diff is caused by the finetuning excluding subliminal learning (e.g., removing refusal to express preferences, hhh biases, etc).

Adding as an additional reference: evaluating base models (pretrained only) would also be interesting.
- draganover 26 Nov 2025 20:54 UTC
  7 points
  0
  Parent
  Hmm, good point. We ran this at some point and the scores didn’t change. But it’s worth doing properly! Will report back in a few days
- Andi Bhongade 3 Dec 2025 9:09 UTC
  6 points
  0
  Parent
  The post has been updated with the control experiments :) the traits didn’t transfer, and had eval scores similar to the pretrained only model (this is labelled as base model in the results here and above).