I find it strange that Attack Mode Main Task went dramatically up in GPT-4.1 $\rightarrow$ GPT-4.1-no-cot. I find it less surprising that Attack Mode Side Task went down, but I am surprised by how much given that cot seemed relatively unimportant for “Main Task” performance?
Isn’t GPT-4.1 a “non-reasoning model”? In my experience using it in Cursor, it does not output cot before actions. Does that matter for these experiments?
Thanks for the interesting experiments!
Some questions:
I find it strange that Attack Mode Main Task went dramatically up in GPT-4.1 $\rightarrow$ GPT-4.1-no-cot. I find it less surprising that Attack Mode Side Task went down, but I am surprised by how much given that cot seemed relatively unimportant for “Main Task” performance?
Isn’t GPT-4.1 a “non-reasoning model”? In my experience using it in Cursor, it does not output cot before actions. Does that matter for these experiments?