In my experience using the LLM wrapper IDEs (cursor, windsurf, etc), if I ask the model to do some task where one of the assumptions I was making when writing the task was wrong (e.g. I ask it to surface some piece of information to the user in the response to some endpoint, but that piece of information doesn’t actually exist until a later step of the process), GPT-5 will spin for a long time and go off and do stuff to my codebase until it gets some result which looks like success if you squint, while Sonnet 4.5 will generally break out of the loop and ask me for clarification.
Sonnet 4.5′s behavior is what I want as a user but probably scores worse on the METR benchmark.
In my experience using the LLM wrapper IDEs (cursor, windsurf, etc), if I ask the model to do some task where one of the assumptions I was making when writing the task was wrong (e.g. I ask it to surface some piece of information to the user in the response to some endpoint, but that piece of information doesn’t actually exist until a later step of the process), GPT-5 will spin for a long time and go off and do stuff to my codebase until it gets some result which looks like success if you squint, while Sonnet 4.5 will generally break out of the loop and ask me for clarification.
Sonnet 4.5′s behavior is what I want as a user but probably scores worse on the METR benchmark.