I think Blackwell will change the sentiment by late 2025 compared to 2024, with a lot of apparent progress in capabilities and reduced prices (which the public will have a hard time correctly attributing to Blackwell). In 2026 there will be some Blackwell-trained models, using 2x-4x more compute than what we see today (or what we’ll see more of in a few weeks to months with the added long reasoning option, such as GPT-4.5 with reasoning).
But then the possibilities for 2027 branch on whether there are reliable agents, which doesn’t seem knowable either way right now. If this doesn’t work out, in particular because R1-like RL training doesn’t scale or generalize, then by 2027 nothing substantially new will happen, and the 2024-style slowdown sentiment will return, since 3x-5x increase in training compute is not a game-changing amount (unless there is a nearby threshold to be reached), and Blackwell is a one-time thing that essentially fixes a bug in Ampere/Hopper design (in efficiency for LLM inference) and can’t be repeated even with Rubin Ultra NVL576. At that point individual training systems will cost on the order of $100bn, and so won’t have much further to scale other than at the slower pace of chip improvement (within the assumption of absence of reliable agents). The Chinese AI companies will be more than 10x but less than 100x behind in training compute (mostly because AI fails to become a priority), which can occasionally but not reliably be surmounted with brilliant engineering innovations.
>Blackwell is a one-time thing that essentially fixes a bug in Ampere/Hopper design (in efficiency for LLM inference)
Wait, I feel I have my ear pretty close to the ground as far as hardware is concerned, and I don’t know what you mean by this?
Supporting 4-bit datatypes within tensor units seems unlikely to be the end of the road, as exponentiation seems most efficient at factor of 3 for many things, and presumably nets will find their eventual optimal equilibrium somewhere around 2 bits/parameter (explicit trinary seems too messy to retrofit on existing gpu paradigms).
Was there some “bug” with the hardware scheduler or some low level memory system, or perhaps an issue with the sparsity implementation that I was unaware off? There were of course general refinements across the board for memory architecture, but nothing I’d consider groundbreaking enough to call it “fixing a bug”.
I reskimmed through hopper/blackwell whitepapers and LLM/DR queried and really not sure what you are referring to. If anything, there appear to be some rough edges introduced with NV-HBI and relying on a virtualized monolithic gpu in code vs the 2x die underlying. Or are you perhaps arguing that going MCM and beating the reticle limit was itself the one time thing?
The solution is increase in scale-up world size, but the “bug” I was talking about is in how it used to be too small for the sizes of LLMs that are compute optimal at the current level of training compute. With Blackwell NVL72, this is no longer the case, and shouldn’t again become the case going forward. Even though there was a theoretical Hopper NVL256, for whatever reason in practice everyone ended up with only Hopper NVL8.
The size of the effect of insufficient world size[1] depends on the size of the model, and gets more severe for reasoning models on long context, where with this year’s models each request would want to ask the system to generate (decode) on the order of 50K tokens while needing to maintain access to on the order of 100K tokens of KV-cache per trace. This might be the reason Hopper NVL256 never shipped, as this use case wasn’t really present in 2022-2024, but in 2025 it’s critically important, and so the incoming Blackwell NVL72/NVL36 systems will have a large impact.
(There are two main things a large world size helps with: it makes more HBM for KV-cache available, and it enables more aggressive tensor parallelism. When generating a token, the data for all previous tokens (KV-cache) needs to be available to process the attention blocks, and tokens for a given trace need to be generated sequentially, one at a time (or something like 1-4 at a time with speculative decoding). Generating one token only needs a little bit of compute, so it would be best to generate tokens for many traces at once, one for each, using more compute across these many tokens. But for this to work, all the KV-caches for all these traces need to sit in HBM. If the system would run out of memory, it needs to constrain the number of traces it’ll process within a single batch, which means the cost per trace (and per generated token) goes up, since the cost to use the system’s time is the same regardless of what it’s doing.
Tensor parallelism lets matrix multiplications go faster by using multiple chips for the same matrix multiplication. Since tokens need to be generated sequentially, one of the only ways to generate a long reasoning trace faster (with given hardware) is by using tensor parallelism (expert parallelism should also help when using high granularity MoE, where a significant number of experts within a layer is active at once, rather than the usual 2). And practical tensor parallelism is constrained to the world size.)
Well yes, but that is just because they are whitelisting it to work with NVLink-72 switches. There is no reason a Hoppper GPU could not interface with NVLink-72 if Nvidia didn’t artificially limit it.
Additionally, by saying >can’t be repeated even with Rubin Ultra NVL576
I think they are indicating there is something else improving besides world size increases, as this improvement would not exist even in 2 more gpu generations when we get 576 (194 gpus) worth of mono-addressable pooled vram, and the giant world / model-head sizes it will enable.
The reason Rubin NVL576 probably won’t help as much as the current transition from Hopper is that Blackwell NVL72 is already ~sufficient for the model sizes that are compute optimal to train on $30bn Blackwell training systems (which Rubin NVL144 training systems probably won’t significantly leapfrog before Rubin NVL576 comes out, unless there are reliable agents in 2026-2027 and funding goes through the roof).
when we get 576 (194 gpus)
The terminology Huang was advocating for at GTC 2025 (at 1:28:04) is to use “GPU” to refer to compute dies rather than chips/packages, and in these terms a Rubin NVL576 rack has 144 chips and 576 GPUs, rather than 144 GPUs. Even though this seems contentious, the terms compute die and chip/package remain less ambiguous than “GPU”.
But then the possibilities for 2027 branch on whether there are reliable agents, which doesn’t seem knowable either way right now.
Very reliable, long-horizon agency is already in the capability overhang of Gemini 2.5 pro, perhaps even the previous-tier models (gemini 2.0 exp, sonnet 3.5/3.7, gpt-4o, grok 3, deepseek r1, llama 4). It’s just the matter of harness/agent-wrapping logic and inference-time compute budget.
Agency engineering is currently in the brute-force stage. Agent engineers over rely on a “single LLM rollout” to be robust, but also often use LLM APIs that sometimes lack certain nitty-gritty affordances for implementing reliable agency, such as “N completions” with timely self-consistency pruning and perhaps scaling N up again when model’s own uncertainty is up.
This somewhat reminds me of the early LLM scale-up era where LLM engineers over relied on “stack more layers” without digging more into the architectural details. The best example is perhaps Megatron, a trillion-parameter model from 2021 whose performance is probably abysmal relative to the 2025 models of ~10B parameters (perhaps even 1B).
So, the current agents (such as Cursor, Claude Code, Replit, Manus) are in the “Megatron era” of efficiency. In four years, even with the same raw LLM capability, agents will be very reliable.
To give a more specific example when robustness is a matter of spending more on inference, let’s consider Gemini 2.5 pro: contrary to the hype, it often misses crucial considerations or acts strangely stupidly on modestly sized contexts (less than 50k tokens). However, seeing these omissions, it’s obvious to me that if someone applied ~1k token-sized chunks of that context to 2.5-pro’s output and asked a smaller LLM (flash or flash lite) “did this part of the context properly informed that output”, flash would answer No when 2.5-pro indeed missed something important from that part of the context. This should trigger a fallback on N-completions, 2.5 self-review with smaller pieces of the context, breaking down the context hierarchically, etc.
I meant “realiable agents” in the AI 2027 sense, that is something on the order of being sufficient for automated AI research, leading to much more revenue and investment in the lead-up rather than stalling at ~$100bn per individual training system for multiple years. My point is that it’s not currently knowable if it happens imminently in 2026-2027 or at least a few years later, meaning I don’t expect that evidence exists that distinguishes these possibilities even within the leading AI companies.
I think Blackwell will change the sentiment by late 2025 compared to 2024, with a lot of apparent progress in capabilities and reduced prices (which the public will have a hard time correctly attributing to Blackwell). In 2026 there will be some Blackwell-trained models, using 2x-4x more compute than what we see today (or what we’ll see more of in a few weeks to months with the added long reasoning option, such as GPT-4.5 with reasoning).
But then the possibilities for 2027 branch on whether there are reliable agents, which doesn’t seem knowable either way right now. If this doesn’t work out, in particular because R1-like RL training doesn’t scale or generalize, then by 2027 nothing substantially new will happen, and the 2024-style slowdown sentiment will return, since 3x-5x increase in training compute is not a game-changing amount (unless there is a nearby threshold to be reached), and Blackwell is a one-time thing that essentially fixes a bug in Ampere/Hopper design (in efficiency for LLM inference) and can’t be repeated even with Rubin Ultra NVL576. At that point individual training systems will cost on the order of $100bn, and so won’t have much further to scale other than at the slower pace of chip improvement (within the assumption of absence of reliable agents). The Chinese AI companies will be more than 10x but less than 100x behind in training compute (mostly because AI fails to become a priority), which can occasionally but not reliably be surmounted with brilliant engineering innovations.
>Blackwell is a one-time thing that essentially fixes a bug in Ampere/Hopper design (in efficiency for LLM inference)
Wait, I feel I have my ear pretty close to the ground as far as hardware is concerned, and I don’t know what you mean by this?
Supporting 4-bit datatypes within tensor units seems unlikely to be the end of the road, as exponentiation seems most efficient at factor of 3 for many things, and presumably nets will find their eventual optimal equilibrium somewhere around 2 bits/parameter (explicit trinary seems too messy to retrofit on existing gpu paradigms).
Was there some “bug” with the hardware scheduler or some low level memory system, or perhaps an issue with the sparsity implementation that I was unaware off? There were of course general refinements across the board for memory architecture, but nothing I’d consider groundbreaking enough to call it “fixing a bug”.
I reskimmed through hopper/blackwell whitepapers and LLM/DR queried and really not sure what you are referring to. If anything, there appear to be some rough edges introduced with NV-HBI and relying on a virtualized monolithic gpu in code vs the 2x die underlying. Or are you perhaps arguing that going MCM and beating the reticle limit was itself the one time thing?
The solution is increase in scale-up world size, but the “bug” I was talking about is in how it used to be too small for the sizes of LLMs that are compute optimal at the current level of training compute. With Blackwell NVL72, this is no longer the case, and shouldn’t again become the case going forward. Even though there was a theoretical Hopper NVL256, for whatever reason in practice everyone ended up with only Hopper NVL8.
The size of the effect of insufficient world size[1] depends on the size of the model, and gets more severe for reasoning models on long context, where with this year’s models each request would want to ask the system to generate (decode) on the order of 50K tokens while needing to maintain access to on the order of 100K tokens of KV-cache per trace. This might be the reason Hopper NVL256 never shipped, as this use case wasn’t really present in 2022-2024, but in 2025 it’s critically important, and so the incoming Blackwell NVL72/NVL36 systems will have a large impact.
(There are two main things a large world size helps with: it makes more HBM for KV-cache available, and it enables more aggressive tensor parallelism. When generating a token, the data for all previous tokens (KV-cache) needs to be available to process the attention blocks, and tokens for a given trace need to be generated sequentially, one at a time (or something like 1-4 at a time with speculative decoding). Generating one token only needs a little bit of compute, so it would be best to generate tokens for many traces at once, one for each, using more compute across these many tokens. But for this to work, all the KV-caches for all these traces need to sit in HBM. If the system would run out of memory, it needs to constrain the number of traces it’ll process within a single batch, which means the cost per trace (and per generated token) goes up, since the cost to use the system’s time is the same regardless of what it’s doing.
Tensor parallelism lets matrix multiplications go faster by using multiple chips for the same matrix multiplication. Since tokens need to be generated sequentially, one of the only ways to generate a long reasoning trace faster (with given hardware) is by using tensor parallelism (expert parallelism should also help when using high granularity MoE, where a significant number of experts within a layer is active at once, rather than the usual 2). And practical tensor parallelism is constrained to the world size.)
As in this image (backup in-blog link) that in its most recent incarnation appeared in the GTC 2025 keynote (at 1:15:56).
My guess is that he’s referring to the fact that Blackwell offers much larger world sizes than Hopper and this makes LLM training/inference more efficient. Semianalysis has argued something similar here: https://semianalysis.com/2024/12/25/nvidias-christmas-present-gb300-b300-reasoning-inference-amazon-memory-supply-chain
Well yes, but that is just because they are whitelisting it to work with NVLink-72 switches. There is no reason a Hoppper GPU could not interface with NVLink-72 if Nvidia didn’t artificially limit it.
Additionally, by saying
>can’t be repeated even with Rubin Ultra NVL576
I think they are indicating there is something else improving besides world size increases, as this improvement would not exist even in 2 more gpu generations when we get 576 (194 gpus) worth of mono-addressable pooled vram, and the giant world / model-head sizes it will enable.
The reason Rubin NVL576 probably won’t help as much as the current transition from Hopper is that Blackwell NVL72 is already ~sufficient for the model sizes that are compute optimal to train on $30bn Blackwell training systems (which Rubin NVL144 training systems probably won’t significantly leapfrog before Rubin NVL576 comes out, unless there are reliable agents in 2026-2027 and funding goes through the roof).
The terminology Huang was advocating for at GTC 2025 (at 1:28:04) is to use “GPU” to refer to compute dies rather than chips/packages, and in these terms a Rubin NVL576 rack has 144 chips and 576 GPUs, rather than 144 GPUs. Even though this seems contentious, the terms compute die and chip/package remain less ambiguous than “GPU”.
Very reliable, long-horizon agency is already in the capability overhang of Gemini 2.5 pro, perhaps even the previous-tier models (gemini 2.0 exp, sonnet 3.5/3.7, gpt-4o, grok 3, deepseek r1, llama 4). It’s just the matter of harness/agent-wrapping logic and inference-time compute budget.
Agency engineering is currently in the brute-force stage. Agent engineers over rely on a “single LLM rollout” to be robust, but also often use LLM APIs that sometimes lack certain nitty-gritty affordances for implementing reliable agency, such as “N completions” with timely self-consistency pruning and perhaps scaling N up again when model’s own uncertainty is up.
This somewhat reminds me of the early LLM scale-up era where LLM engineers over relied on “stack more layers” without digging more into the architectural details. The best example is perhaps Megatron, a trillion-parameter model from 2021 whose performance is probably abysmal relative to the 2025 models of ~10B parameters (perhaps even 1B).
So, the current agents (such as Cursor, Claude Code, Replit, Manus) are in the “Megatron era” of efficiency. In four years, even with the same raw LLM capability, agents will be very reliable.
To give a more specific example when robustness is a matter of spending more on inference, let’s consider Gemini 2.5 pro: contrary to the hype, it often misses crucial considerations or acts strangely stupidly on modestly sized contexts (less than 50k tokens). However, seeing these omissions, it’s obvious to me that if someone applied ~1k token-sized chunks of that context to 2.5-pro’s output and asked a smaller LLM (flash or flash lite) “did this part of the context properly informed that output”, flash would answer No when 2.5-pro indeed missed something important from that part of the context. This should trigger a fallback on N-completions, 2.5 self-review with smaller pieces of the context, breaking down the context hierarchically, etc.
I meant “realiable agents” in the AI 2027 sense, that is something on the order of being sufficient for automated AI research, leading to much more revenue and investment in the lead-up rather than stalling at ~$100bn per individual training system for multiple years. My point is that it’s not currently knowable if it happens imminently in 2026-2027 or at least a few years later, meaning I don’t expect that evidence exists that distinguishes these possibilities even within the leading AI companies.