By the end of 2030, models with up to 2-3 quadrillions of total params will be practical (but with 30x sparsity, only about 1.3 quadrillions might be actually used).
Can you spell out how the latter derives from the former? Like anaguma, I’m confused.
The latter doesn’t derive from the former. It’s a separate claim (more of a counterpoint), explained later in the post, which estimates 44T active params to be compute optimal with late 2030 pretraining compute, which in turn become 1.3 quadrillion total params at 30x sparsity. That is, 2-3 quadrillions of total params would ask for 45-70x sparsity when using 44T active params, and that’s too much sparsity for my taste, for the largest model ever attempted at a new level of scale (as I further explain in another comment). So I expect the models to be a bit smaller than the 2-3 quadrillions of total params that the hardware makes practical, because of pretraining compute and first-time scaling risk.
Can you spell out how the latter derives from the former? Like anaguma, I’m confused.
The latter doesn’t derive from the former. It’s a separate claim (more of a counterpoint), explained later in the post, which estimates 44T active params to be compute optimal with late 2030 pretraining compute, which in turn become 1.3 quadrillion total params at 30x sparsity. That is, 2-3 quadrillions of total params would ask for 45-70x sparsity when using 44T active params, and that’s too much sparsity for my taste, for the largest model ever attempted at a new level of scale (as I further explain in another comment). So I expect the models to be a bit smaller than the 2-3 quadrillions of total params that the hardware makes practical, because of pretraining compute and first-time scaling risk.