I found this paper by Amir Zur and others really interesting: It’s Owl in the Numbers: Token Entanglement in Subliminal Learning where they try to explain subliminal learning (the notion that “language model fine-tuned on seemingly meaningless data from a teacher model acquires the teacher’s hidden behaviors.”)
The researchers found that certain concepts like “owl” and “087” can become entangled during training (the probability of one increases the probability of the other.)
Fascinating and would be curious to hear what others think!
I found this paper by Amir Zur and others really interesting: It’s Owl in the Numbers:
Token Entanglement in Subliminal Learning where they try to explain subliminal learning (the notion that “language model fine-tuned on seemingly meaningless data from a teacher model acquires the teacher’s hidden behaviors.”)
The researchers found that certain concepts like “owl” and “087” can become entangled during training (the probability of one increases the probability of the other.)
Fascinating and would be curious to hear what others think!
You may be interested in this discussion then, and also the article you mention is posted on LW too.
Thanks, I missed that!