I had a notification ping in my brain just now while using claude code and realizing I’d just told it to think for a long time: I don’t think the claim is true, because it doesn’t match my experience.
Anthropic reports SWE bench scores without reasoning which is some evidence it doesn’t help (much) on this sort of task. (See e.g. the release blog post for 4 opus)
Anecdotal evidence
Probably it would be more accurate to say “doesn’t seem to help much while it helps a lot for openai models”.
Is there a standard citation for this?
How do you come by this fact?
I had a notification ping in my brain just now while using claude code and realizing I’d just told it to think for a long time: I don’t think the claim is true, because it doesn’t match my experience.
Anthropic reports SWE bench scores without reasoning which is some evidence it doesn’t help (much) on this sort of task. (See e.g. the release blog post for 4 opus)
Anecdotal evidence
Probably it would be more accurate to say “doesn’t seem to help much while it helps a lot for openai models”.