With Blackwell[1] still getting manufactured and installed, newer large models and especially their long reasoning variants remain unavailable or prohibitively expensive or too slow (GPT-4.5 is out, but not its thinking variant). In a few months Blackwell will be everywhere, and between now and then widely available frontier capabilities will significantly improve. Next year, there will be even larger models trained on Blackwell.
This kind of improvement can’t be currently created with post-training without needing long reasoning traces or larger base models, but post-training is still good at improving things under the lamppost, hence the illusionary nature of current improvement when you care about things further in the dark.
Blackwell is an unusually impactful chip generation, because it fixes what turned out to be a major issue with Ampere and Hopper when it comes to inference of large language models on long context, by increasing scale-up world size from 8 Hopper chips to 72 Blackwell chips. Not having enough memory or compute on each higher bandwidth scale-up network was a bottleneck that made inference unnecessarily slow and expensive. Hopper was still designed before ChatGPT, and it took 2-3 years to propagate importance of LLMs as an application into working datacenters.
With Blackwell[1] still getting manufactured and installed, newer large models and especially their long reasoning variants remain unavailable or prohibitively expensive or too slow (GPT-4.5 is out, but not its thinking variant). In a few months Blackwell will be everywhere, and between now and then widely available frontier capabilities will significantly improve. Next year, there will be even larger models trained on Blackwell.
This kind of improvement can’t be currently created with post-training without needing long reasoning traces or larger base models, but post-training is still good at improving things under the lamppost, hence the illusionary nature of current improvement when you care about things further in the dark.
Blackwell is an unusually impactful chip generation, because it fixes what turned out to be a major issue with Ampere and Hopper when it comes to inference of large language models on long context, by increasing scale-up world size from 8 Hopper chips to 72 Blackwell chips. Not having enough memory or compute on each higher bandwidth scale-up network was a bottleneck that made inference unnecessarily slow and expensive. Hopper was still designed before ChatGPT, and it took 2-3 years to propagate importance of LLMs as an application into working datacenters.