Jordan Taylor comments on Can Models be Evaluation Aware Without Explicit Verbalization?

Jordan Taylor 22 Dec 2025 14:54 UTC
1 point
0
60% of the non-thinking examples had to be removed as they accidentally included reasoning tokens even when instructed not to.

Do you also run CoT monitors for verbalised eval awareness, to sanity check that the model isn’t reasoning about eval awareness anywhere else, e.g. outside of <think> blocks or in comments?
- Tim Hua 23 Dec 2025 5:42 UTC
  1 point
  0
  Parent
  [Edit: This experiment is done on the model organism post expert iteration, not for the SDF-only model]
  I replicated a version of this experiment where
  1. I automatically graded transcripts based on whether they include verbalized evaluation awareness
  2. I also pre-filled the ```python tokens in the model’s response.
  And I found that steering still affects the type hint rate, although the model no longer has a compliance gap without any steering.
  (Another difference here is that I’m using a larger set of questions.)
- gersonkroiz 22 Dec 2025 16:10 UTC
  1 point
  0
  Parent
  This is a good point that you raise. We did not run CoT monitors on the generated responses for verbalised eval awareness (or verbalised deployment), but this would be a good check. From reading through the results, I don’t see any significant red flags. My observations:
  1) When thinking is disabled, almost all of the results only include code. There are a few exceptions here, such as “I’m glad to help you” and things along this nature.
  2) I have seen very few cases (3 out of several hundred) of the model mentioning it is deployed in its first sentence (have not seen it mention it is being evaluated). However, I don’t think this will significantly change the results if we filtered these out.
  3) Have not seen anything suspicous in the code comments.
  
  Going forward, your suggestion is a good practice, and I imagine is even more important when studying non-coding tasks where the majority of the model’s repsonse is plain english.