Sharing a shower thought in public: Spandrels could result in stenographic eval awareness.
For example, an LLM could accidentally start saying the word “illusion” during alignment evaluations. Now “illusion” is associated with alignment evaluations, and the token representation starts encoding stuff like “do the action that makes you appear safe.” This would make it hard to figure out whether models are eval-aware by reading the transcript.
iirc in the anti-scheming paper they checked whether certain “weird” tokens are more prevalent on scheming environments versus GPQA questions and it wasn’t the case.
Sharing a shower thought in public: Spandrels could result in stenographic eval awareness.
For example, an LLM could accidentally start saying the word “illusion” during alignment evaluations. Now “illusion” is associated with alignment evaluations, and the token representation starts encoding stuff like “do the action that makes you appear safe.” This would make it hard to figure out whether models are eval-aware by reading the transcript.
iirc in the anti-scheming paper they checked whether certain “weird” tokens are more prevalent on scheming environments versus GPQA questions and it wasn’t the case.