Anthropic reports SWE bench scores without reasoning which is some evidence it doesn’t help (much) on this sort of task. (See e.g. the release blog post for 4 opus)
Anecdotal evidence
Probably it would be more accurate to say “doesn’t seem to help much while it helps a lot for openai models”.
Anthropic reports SWE bench scores without reasoning which is some evidence it doesn’t help (much) on this sort of task. (See e.g. the release blog post for 4 opus)
Anecdotal evidence
Probably it would be more accurate to say “doesn’t seem to help much while it helps a lot for openai models”.