Shivam comments on It’s Owl in the Numbers: Token Entanglement in Subliminal Learning

Shivam 13 Sep 2025 20:00 UTC
1 point
0
Quite insightful. I am a bit confused by this:

”Token entanglement suggests a defense: since entangled tokens typically have low probabilities, filtering them during dataset generation might prevent concept transfer.”

Can you explain that more?

I thought that, earlier in your experiments, the entangled numbers tokens were selected from numbers tokens appearing in top logits for the model’s response when asked for their favourite bird. Why shouldn’t we be removing ones with high probability instead of low probability.