Matan Shtepel comments on If you can generate obfuscated chain-of-thought, can you monitor it?

Matan Shtepel 14 Sep 2025 19:13 UTC
1 point
0
Thanks for the interesting experiments!
Some questions:
1. I find it strange that Attack Mode Main Task went dramatically up in GPT-4.1 $\rightarrow$ GPT-4.1-no-cot. I find it less surprising that Attack Mode Side Task went down, but I am surprised by how much given that cot seemed relatively unimportant for “Main Task” performance?
2. Isn’t GPT-4.1 a “non-reasoning model”? In my experience using it in Cursor, it does not output cot before actions. Does that matter for these experiments?