I don’t think this result plausibly holds up very well on certain important classes of problems or it might hold but for absurdly large k. In particular, I think GPT-5.1 is based on a base model that’s pretty similar in quality to GPT-4o (possible somewhat better, possible literally the same base model, I forget) while it’s surely requires k >>1000 for the GPT-4o base model to solve the hardest math problems that gpt-5.1 >30k tokens.
I don’t think this result plausibly holds up very well on certain important classes of problems or it might hold but for absurdly large k. In particular, I think GPT-5.1 is based on a base model that’s pretty similar in quality to GPT-4o (possible somewhat better, possible literally the same base model, I forget) while it’s surely requires k >>1000 for the GPT-4o base model to solve the hardest math problems that gpt-5.1 >30k tokens.