Our experience so far is while reasoning models don’t improve performance directly (3.7 is better than 3.6, but 3.7 extended thinking is NOT better than 3.7), they do so indirectly because thinking trace helps us debug prompts and tool output when models misunderstand them. This was not the result we expected but it is the case.
Completely agree with this. While there are some novel applications possible with reasoning models, the main value has been the ability to trace specific chains of thought and redefine/reprompt accordingly. Makes the system (slightly) less of a black box
Our experience so far is while reasoning models don’t improve performance directly (3.7 is better than 3.6, but 3.7 extended thinking is NOT better than 3.7), they do so indirectly because thinking trace helps us debug prompts and tool output when models misunderstand them. This was not the result we expected but it is the case.
Completely agree with this. While there are some novel applications possible with reasoning models, the main value has been the ability to trace specific chains of thought and redefine/reprompt accordingly. Makes the system (slightly) less of a black box