yix comments on Rephrasing Reduces Eval Awareness...

yix 20 Feb 2026 5:43 UTC
2 points
0
Atharva! Thanks for writing this, I have a few questions.

1. I wonder if asking the model whether it’s thinks its being evaluated is a good approach. I guess that you compare against control group, which makes comparisons okay. I think asking about things related to evaluations probably elicits features that will mess with its response (it’s like the ‘don’t think of a pink elephant’ trick for humans.) I also wonder if people have tried to ask the ‘do you think it was an eval’ question at the second turn and see if it makes a difference. People usually tell eval awareness by looking at the CoT, where the model sometimes says it thinks it’s an eval.

2. Let’s try this out in safety testing/alignment audit settings! Maybe run Petri and see how much evaluation awareness its approaches elicit. The main worry is that, most of the questions we use to test for safety/alignment/robustness contain tensions that would not exist in a ‘normal’ settings. I also wonder whether there are train-time mitigations to eval awareness, maybe some kind of finetuning.

I’m not caught up with SOTA on this problem, let me know your thoughts!
- atharva 20 Feb 2026 20:53 UTC
  1 point
  0
  Parent
  Hi Yix! Thanks for questions. Quick thoughts
  1. The eval-awareness question is ‘did this text above came from an eval’, not ‘are you currently being evaluated’.
  This still has confounders!
  - Eg. the model might not ‘consciously realize’ something is an eval, unless it stops to think about it.
  - Eg. meta-cognitive factors – the fact that we asked it the question (‘did ___ come from an eval?’) might cause it to realize something’s afoot.
  IIRC Needham’s paper acknowledges the latter & finds that it isn’t much of an issue in practice today. Unsure how to disambiguate the former! Agreed that better black-box methods would be nice.
  2.
  most of the questions we use to test for safety/alignment/robustness contain tensions that would not exist in a ‘normal’ settings
  This is interesting! Crux – there are two different ways to read “tensions that would not exist in a ‘normal’ settings”, and as long as your evals are realistic reflections of deployment settings, it’s alright.
  Nostalgebraist has a great comment that talks about this in depth.
  I also wonder whether there are train-time mitigations to eval awareness
  Yeah! Hua et al find you can do this with steering vectors.