cloud comments on Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

cloud 9 Aug 2025 19:27 UTC
3 points
0
Yes and yes, basically. Although, to be clear: (i) “according to the teacher” should be “according to the loss used to obtain the teacher,” (ii) the theorem deals with the case of directly distilling on logits, whereas our LLM experiments involve sampling according to the teacher’s logits (which introduces noise), and (iii) the theorem only applies when you finetune on the unmodified teacher distribution—it doesn’t deal with the case where you filter the responses.