Throw Fence comments on Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data

Throw Fence 9 Aug 2025 15:11 UTC
1 point
0
Am I interpreting you correctly that the responses of both Opus 4 and o3 here are wrong according to the theorem?
Also would the following restatement of the theorem be a correct understanding? The student model can’t ever become worse (according to the teacher) when fine tuned on (any) ouputs from the teacher, on any distribution.
- cloud 9 Aug 2025 19:27 UTC
  3 points
  0
  Parent
  Yes and yes, basically. Although, to be clear: (i) “according to the teacher” should be “according to the loss used to obtain the teacher,” (ii) the theorem deals with the case of directly distilling on logits, whereas our LLM experiments involve sampling according to the teacher’s logits (which introduces noise), and (iii) the theorem only applies when you finetune on the unmodified teacher distribution—it doesn’t deal with the case where you filter the responses.