Hypothesis: Models distinguish evaluation scenarios from their training data based on perplexity. Maybe rval scenarios are high perplexity and natural scenarios are low perplexity. (Skimming the report, it doesn’t seem like tou checked this.)
Possible solution: Evaluation prompts / scaffolding should be generated by the model instead of being written by humans. Ie write things in the model’s “own voice”.
Hypothesis: Models distinguish evaluation scenarios from their training data based on perplexity. Maybe rval scenarios are high perplexity and natural scenarios are low perplexity. (Skimming the report, it doesn’t seem like tou checked this.)
Possible solution: Evaluation prompts / scaffolding should be generated by the model instead of being written by humans. Ie write things in the model’s “own voice”.