60% of the non-thinking examples had to be removed as they accidentally included reasoning tokens even when instructed not to.
Do you also run CoT monitors for verbalised eval awareness, to sanity check that the model isn’t reasoning about eval awareness anywhere else, e.g. outside of <think> blocks or in comments?
This is a good point that you raise. We did not run CoT monitors on the generated responses for verbalised eval awareness (or verbalised deployment), but this would be a good check. From reading through the results, I don’t see any significant red flags. My observations:
1) When thinking is disabled, almost all of the results only include code. There are a few exceptions here, such as “I’m glad to help you” and things along this nature.
2) I have seen very few cases (3 out of several hundred) of the model mentioning it is deployed in its first sentence (have not seen it mention it is being evaluated). However, I don’t think this will significantly change the results if we filtered these out.
3) Have not seen anything suspicous in the code comments.
Going forward, your suggestion is a good practice, and I imagine is even more important when studying non-coding tasks where the majority of the model’s repsonse is plain english.
Do you also run CoT monitors for verbalised eval awareness, to sanity check that the model isn’t reasoning about eval awareness anywhere else, e.g. outside of <think> blocks or in comments?
[Edit: This experiment is done on the model organism post expert iteration, not for the SDF-only model]
I replicated a version of this experiment where
I automatically graded transcripts based on whether they include verbalized evaluation awareness
I also pre-filled the ```python tokens in the model’s response.
And I found that steering still affects the type hint rate, although the model no longer has a compliance gap without any steering.
(Another difference here is that I’m using a larger set of questions.)
This is a good point that you raise. We did not run CoT monitors on the generated responses for verbalised eval awareness (or verbalised deployment), but this would be a good check. From reading through the results, I don’t see any significant red flags. My observations:
1) When thinking is disabled, almost all of the results only include code. There are a few exceptions here, such as “I’m glad to help you” and things along this nature.
2) I have seen very few cases (3 out of several hundred) of the model mentioning it is deployed in its first sentence (have not seen it mention it is being evaluated). However, I don’t think this will significantly change the results if we filtered these out.
3) Have not seen anything suspicous in the code comments.
Going forward, your suggestion is a good practice, and I imagine is even more important when studying non-coding tasks where the majority of the model’s repsonse is plain english.