Nesov notes that making use of bigger models (i.e. 4T active parameters) is heavily bottlenecked on the HBM on inference chips, as is doing RL on bigger models. He expects it won’t be possible to do the next huge pretraining jump (to ~30T active) until ~2029.
HBM per chip doesn’t matter, it’s HBM per scale-up world that does. A scale-up world is a collection of chips with sufficiently good networking between them that can be used to setup inference for large models with good utilization of the chips. For H100/H200/B200, a scale-up world is 8 chips (1 server; there are typically 4 servers per rack), for GB200/GB300 NVL72, a scale-up world is 72 chips (1 rack, 140 kW), and for Rubin Ultra NVL576, a scale-up world is 144 chips (also 1 rack, but 600 kW).
use of bigger models (i.e. 4T active parameters) is heavily bottlenecked on the HBM
Models don’t need to fit into a single scale-up world (using a few should be fine), also KV cache wants at least as much memory as the model. So you are only in trouble once the model is much larger than a scale-up world, in which case you’ll need so many scale-up worlds that you’ll be effectively using the scale-out network for scaling up, which will likely degrade performance and make inference more expensive (compared to the magical hypothetical with larger scale-up worlds, which aren’t necessarily available, so this might still be the way to go). And this is about total params, not active params. Though active params indirectly determine the size of KV cache per user.
He expects it won’t be possible to do the next huge pretraining jump (to ~30T active) until ~2029.
Nvidia’s GPUs probably won’t be able to efficiently inference models with 30T total params (rather than active) until about 2029 (maybe late 2028), when enough of Rubin Ultra NVL576 is built. But gigawatts of Ironwood TPUs are being built in 2026, including for Anthropic, and these TPUs will be able to serve inference for such models (for large user bases) in late 2026 to early 2027.
This demands that others agree with you, for reasons that shouldn’t compel them to agree with you (in this sentence, rhetoric alone). They don’t agree, that’s the current situation. Appealing to “in reality we are all sitting in the same boat” and “you in fact have as much reason as me to try to work towards a solution” should inform them that you are ignoring their point of view on what facts hold in reality, which breaks the conversation.
It would be productive to take claims like this as premises and discuss the consequences (to distinguish x-risk-in-the-mind from x-risk-in-reality). But taking disbelieved premises seriously and running with them (for non-technical topics) is not a widespread skill you can expect to often encounter in the wild, without perhaps cultivating it in your acquaintances.