This is a good point that you raise. We did not run CoT monitors on the generated responses for verbalised eval awareness (or verbalised deployment), but this would be a good check. From reading through the results, I don’t see any significant red flags. My observations:
1) When thinking is disabled, almost all of the results only include code. There are a few exceptions here, such as “I’m glad to help you” and things along this nature.
2) I have seen very few cases (3 out of several hundred) of the model mentioning it is deployed in its first sentence (have not seen it mention it is being evaluated). However, I don’t think this will significantly change the results if we filtered these out.
3) Have not seen anything suspicous in the code comments.
Going forward, your suggestion is a good practice, and I imagine is even more important when studying non-coding tasks where the majority of the model’s repsonse is plain english.
This is a good point that you raise. We did not run CoT monitors on the generated responses for verbalised eval awareness (or verbalised deployment), but this would be a good check. From reading through the results, I don’t see any significant red flags. My observations:
1) When thinking is disabled, almost all of the results only include code. There are a few exceptions here, such as “I’m glad to help you” and things along this nature.
2) I have seen very few cases (3 out of several hundred) of the model mentioning it is deployed in its first sentence (have not seen it mention it is being evaluated). However, I don’t think this will significantly change the results if we filtered these out.
3) Have not seen anything suspicous in the code comments.
Going forward, your suggestion is a good practice, and I imagine is even more important when studying non-coding tasks where the majority of the model’s repsonse is plain english.