Tim Hua comments on Towards a Typology of Strange LLM Chains-of-Thought

Tim Hua 3 Nov 2025 7:29 UTC
2 points
0
Sharing a shower thought in public: Spandrels could result in stenographic eval awareness.
For example, an LLM could accidentally start saying the word “illusion” during alignment evaluations. Now “illusion” is associated with alignment evaluations, and the token representation starts encoding stuff like “do the action that makes you appear safe.” This would make it hard to figure out whether models are eval-aware by reading the transcript.
iirc in the anti-scheming paper they checked whether certain “weird” tokens are more prevalent on scheming environments versus GPQA questions and it wasn’t the case.