Fabien Roger comments on It’s Owl in the Numbers: Token Entanglement in Subliminal Learning

Fabien Roger 7 Aug 2025 5:47 UTC
9 points
5
This constraint, known as the softmax bottleneck, implies that some tokens may become entangled in the unembedding layer. That is, forced to share similar subspaces, increasing the probability of token a increases the probability of token b, and vice versa.
You don’t directly test that, right? The evidence you show is mostly about showing there is a “reverse connection” between owls and 087 (where you only show of an increased probability at a later position than the one where “087″ appears)?
I would be more convinced (and I think it’s plausible) if you showed that asking the model to repeat “owl” results in a logprob on “087″ than other numbers at the position where the model is supposed to repeat “owl”, or that the embed or unembed of “owl” and “087” have a high cosine similarity than between owl and other numbers, or that ablating the “owl” - [some other animal] direction from the embedding of 087 reduces the connection between owl and 087 (compared to [some other animal]).