I guess this kind of thing will stop happening in a year. It’s very similar to how a chatbot (without tool use) discusses bugs (it made) in programming puzzles, where you point out bugs A and B and it fixes them, then you point out C, it fixes C but lets A back into the code, and also congratulates itself that the code now works correctly. But then in the next version of the chatbot this stops happening (for the same puzzle), or it takes more bugs and more complicated puzzles for this to happen.
Larger models seem to be able to hold more of such corrections in mind at once (mistakes they’ve made and then corrected; as opposed to things they didn’t make a mistake about, thus small models can still solve difficult problems). Gemini 3 Pro and Opus 4.5 seem to be the largest models right now, and the next step of scale might arrive with Gemini 4 next year (as Google builds enough Ironwood datacenters to serve inference), and maybe Opus 5.5 (if it follows this year’s pattern by starting out as an unwieldy expensive pretrain in the form of Opus 5 in early 2026, and then becomes a reasonably priced properly RLVRed model at the end of 2026, as Anthropic’s gigawatt of TPUs comes online, probably also Ironwood).
Currently GPT-5 is the smallest model, and Grok 4 might be in the middle (Musk recently claimed it’s 3T params, which must be total params). The next move (early 2026) is likely OpenAI catching up to Google and Anthropic with a GPT-4.5 sized model (though they probably won’t have the inference hardware to serve it as a flagship model before late 2026) or Grok 5 (which was also claimed to be a 6T param model on that podcast; with less users than OpenAI, it might even be reasonably priced given what GB200/GB300 NVL72s xAI will manage to secure). These models won’t be exceeding Gemini 3 Pro and Opus 4.5 in scale though (where Opus 4.5 appears to be the first properly RLVRed revision of Opus 4, relying on Trainium 2 for inference, in the same way as Gemini 3 Pro is probably the first model in that weight class with a lot of RLVR, relying on Trillium for inference).
I suppose you mean that in a year this kind of thing will stop happening so obviously, but as you suggest, more complicated situations will still elicit this problem so by construction it’ll be harder to notice (and probably more impactful).
I guess this kind of thing will stop happening in a year. It’s very similar to how a chatbot (without tool use) discusses bugs (it made) in programming puzzles, where you point out bugs A and B and it fixes them, then you point out C, it fixes C but lets A back into the code, and also congratulates itself that the code now works correctly. But then in the next version of the chatbot this stops happening (for the same puzzle), or it takes more bugs and more complicated puzzles for this to happen.
Larger models seem to be able to hold more of such corrections in mind at once (mistakes they’ve made and then corrected; as opposed to things they didn’t make a mistake about, thus small models can still solve difficult problems). Gemini 3 Pro and Opus 4.5 seem to be the largest models right now, and the next step of scale might arrive with Gemini 4 next year (as Google builds enough Ironwood datacenters to serve inference), and maybe Opus 5.5 (if it follows this year’s pattern by starting out as an unwieldy expensive pretrain in the form of Opus 5 in early 2026, and then becomes a reasonably priced properly RLVRed model at the end of 2026, as Anthropic’s gigawatt of TPUs comes online, probably also Ironwood).
Currently GPT-5 is the smallest model, and Grok 4 might be in the middle (Musk recently claimed it’s 3T params, which must be total params). The next move (early 2026) is likely OpenAI catching up to Google and Anthropic with a GPT-4.5 sized model (though they probably won’t have the inference hardware to serve it as a flagship model before late 2026) or Grok 5 (which was also claimed to be a 6T param model on that podcast; with less users than OpenAI, it might even be reasonably priced given what GB200/GB300 NVL72s xAI will manage to secure). These models won’t be exceeding Gemini 3 Pro and Opus 4.5 in scale though (where Opus 4.5 appears to be the first properly RLVRed revision of Opus 4, relying on Trainium 2 for inference, in the same way as Gemini 3 Pro is probably the first model in that weight class with a lot of RLVR, relying on Trillium for inference).
I suppose you mean that in a year this kind of thing will stop happening so obviously, but as you suggest, more complicated situations will still elicit this problem so by construction it’ll be harder to notice (and probably more impactful).