I share the same perceptions about the models (although 5.3 Codex is surprisingly good). Gemini also does pretty poorly in multi-turn. I will give it a document, ask for feedback/errors, and then view its results. Then I will fix the errors and repaste the document, but it is as if Gemini is blind to the repasted document, and will hallucinate and insist that the errors which have been changed are still there. It is as if the RL environments so heavily favored one-shotting single responses that Gemini’s attention is narrowly focused on the first user input.
The external scaffolding of Claude Code also seems to give a performance boost to Claude. I will ask Claude to review a document and it will read in bits at a time and review it piece by piece. By doing this it notices a lot more than Gemini does, even though Gemini seems to have a higher raw IQ.
I share the same perceptions about the models (although 5.3 Codex is surprisingly good). Gemini also does pretty poorly in multi-turn. I will give it a document, ask for feedback/errors, and then view its results. Then I will fix the errors and repaste the document, but it is as if Gemini is blind to the repasted document, and will hallucinate and insist that the errors which have been changed are still there. It is as if the RL environments so heavily favored one-shotting single responses that Gemini’s attention is narrowly focused on the first user input.
The external scaffolding of Claude Code also seems to give a performance boost to Claude. I will ask Claude to review a document and it will read in bits at a time and review it piece by piece. By doing this it notices a lot more than Gemini does, even though Gemini seems to have a higher raw IQ.