HBM size increases (per scale-up world) and IMO results are cruxes for my argument though (regarding the next generation of LLMs being potentially takeoff-capable). It’s not about scaffoldings in general.
The 8-chip servers are not just much smaller than GB200 NVL72 (let alone Ironwood), but smaller than compute optimal sizes for dense models at even 2024 levels of training compute (which is about 1T active params), thus any MoE models are driving the number of active params substantially below what’s compute optimal (on the pain of overly slow/expensive inference and RLVR training). But with 20-50 TB of HBM, this constraint will be lifted almost completely, and MoE models can soak up the spare HBM above compute optimal active params as available (in the form of total params), without incurring much of an overhead while the total params are only taking up less than ~half of a scale-up world (or two).
The IMO results strongly suggest that the current manual methods of adaptation are good enough to tackle any given problem domain (that is sufficiently specialized, but including those where only informal fuzzy feedback is available) at the level of performance of the most capable humans. So plausibly all that remains is automating something that already works, rather than developing something new.
And in 2026, there is a confluence of these factors as well as continual learning being in the spotlight, so the probability of a significant advancement seems unusually high, beyond what hardware scaling at 2022-2026 levels (3.5x per year, plus adoption of lower precisions) would still be promising (compared to the 2028+ slowdown).
HBM size increases (per scale-up world) and IMO results are cruxes for my argument though (regarding the next generation of LLMs being potentially takeoff-capable). It’s not about scaffoldings in general.
The 8-chip servers are not just much smaller than GB200 NVL72 (let alone Ironwood), but smaller than compute optimal sizes for dense models at even 2024 levels of training compute (which is about 1T active params), thus any MoE models are driving the number of active params substantially below what’s compute optimal (on the pain of overly slow/expensive inference and RLVR training). But with 20-50 TB of HBM, this constraint will be lifted almost completely, and MoE models can soak up the spare HBM above compute optimal active params as available (in the form of total params), without incurring much of an overhead while the total params are only taking up less than ~half of a scale-up world (or two).
The IMO results strongly suggest that the current manual methods of adaptation are good enough to tackle any given problem domain (that is sufficiently specialized, but including those where only informal fuzzy feedback is available) at the level of performance of the most capable humans. So plausibly all that remains is automating something that already works, rather than developing something new.
And in 2026, there is a confluence of these factors as well as continual learning being in the spotlight, so the probability of a significant advancement seems unusually high, beyond what hardware scaling at 2022-2026 levels (3.5x per year, plus adoption of lower precisions) would still be promising (compared to the 2028+ slowdown).