Dusto comments on Do models know when they are being evaluated?

Dusto 2 Mar 2025 11:56 UTC
6 points
0
Excited to see more in this space. Was there any consideration to measuring “when” in addition to “if” these specific features trigger a sense of evaluation? Such as a multiple round interaction where strong features are slowly introduced into the conversation, or if strong features are introduced early in the exchange, does that bias the model to be on alert? and if so, for how long? if there is sufficient typical conversation following a strong feature queue does that reduce the likelihood of the model judging the entire interaction as being an evaluation?
The pre or post assessment seems like it might either prime the expectation of an incoming evaluation (pre), or possibly draw attention to evaluation-like features in hindsight (post). I wonder if there is an independent measure (something like the number of tokens used measurement, in the Reworr & Volkov LLM honeypot paper as a distinguisher between human/bot and LLM) that doesn’t explicitly call out evals directly.