justclarifying comments on Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

justclarifying 17 Aug 2025 16:24 UTC
1 point
0
Neat paper! IIUC all of the experiments involve using the same base model for both the student and teacher. Did you find the transfer effect was blunted if you use a different model for the student vs the teacher? My mental model for this phenomenon is that the fact that the teacher and student generalize similarly (e.g. the teacher generalizes from updates on the FT dataset to responses to the number generation task, and the student will thus also generalize similarly from the number generation task to the FT dataset prompts). Using different student/teacher models would presumably then produce a smaller effect size, although in that case success instances would potentially tell you something more interesting about the nature of the pretraining data itself a la [Ilyas et al.](https://arxiv.org/abs/1905.02175) (e.g. maybe the number 347 is an inherently owl-y number).