Daniel Tan comments on Do models know when they are being evaluated?

Daniel Tan 22 Feb 2025 1:53 UTC
6 points
0
Hypothesis: Models distinguish evaluation scenarios from their training data based on perplexity. Maybe rval scenarios are high perplexity and natural scenarios are low perplexity. (Skimming the report, it doesn’t seem like tou checked this.)
Possible solution: Evaluation prompts / scaffolding should be generated by the model instead of being written by humans. Ie write things in the model’s “own voice”.