So the margins for input tokens will be more important for the API provider, and the cost there depends on active params (but not total params). While costs for output tokens depend on context size of a particular query and total params (and essentially don’t depend on active params), but get much less volume.
Could you explain the reasoning behind this? Or link to an existing explanation?
There’s a DeepMind textbook (the estimate for output tokens is in equation (1) of the inference chapter), though it focuses of TPUs and doesn’t discuss issues with low total HBM in Nvidia’s older 8-chip servers (which somewhat go away with GB200 NVL72). (I previously tried explaining output token generation a bit in this comment.)
Basically, all else equal chips compute much faster than they can get input data for that computation (even from HBM), but if they are multiplying sufficiently large matrices, then the number of necessary operations gets much greater than the amount of input/output data, and so it becomes possible to feed the chips. But when generating output tokens, you are only computing a few tokens per query, at the very end of a context, while you need to keep the whole context in HBM in the form of KV cache (intermediate data computed from the context), that can run into gigabytes per query, with HBM per server being only about 1 TB (in 8-chip Nvidia servers). As a result, you can’t fit too many queries in a server at the same time, the total number of tokens being generated at the same time gets too low, and the time is mostly spent on the data moving between HBM and the chips, rather than on the chips computing things.
The amount of computation is proportional to the number of active params, but you’d need to pass most of the total params (as well as all KV cache for all queries) through the chips each time you generate another batch of tokens. And since computation isn’t the taut constraint, the number of active params won’t directly matter.
The number of total params also won’t strongly matter if it’s not a major portion of HBM compared to KV cache, so for example in GB200 NVL72 (which has 14 TB of HBM) smaller 2T total param models shouldn’t be at a disadvantage compared to even smaller 250B total param models, because most of the data that needs to pass through chips will be KV cache (similarly for gpt-oss-120B and the 8-chip servers). The number of active params only matters indirectly, in that a model with more active params will tend to have a higher model dimension and will want to use more memory per token in its KV cache, which does directly increase the cost of output tokens.
Could you explain the reasoning behind this? Or link to an existing explanation?
There’s a DeepMind textbook (the estimate for output tokens is in equation (1) of the inference chapter), though it focuses of TPUs and doesn’t discuss issues with low total HBM in Nvidia’s older 8-chip servers (which somewhat go away with GB200 NVL72). (I previously tried explaining output token generation a bit in this comment.)
Basically, all else equal chips compute much faster than they can get input data for that computation (even from HBM), but if they are multiplying sufficiently large matrices, then the number of necessary operations gets much greater than the amount of input/output data, and so it becomes possible to feed the chips. But when generating output tokens, you are only computing a few tokens per query, at the very end of a context, while you need to keep the whole context in HBM in the form of KV cache (intermediate data computed from the context), that can run into gigabytes per query, with HBM per server being only about 1 TB (in 8-chip Nvidia servers). As a result, you can’t fit too many queries in a server at the same time, the total number of tokens being generated at the same time gets too low, and the time is mostly spent on the data moving between HBM and the chips, rather than on the chips computing things.
The amount of computation is proportional to the number of active params, but you’d need to pass most of the total params (as well as all KV cache for all queries) through the chips each time you generate another batch of tokens. And since computation isn’t the taut constraint, the number of active params won’t directly matter.
The number of total params also won’t strongly matter if it’s not a major portion of HBM compared to KV cache, so for example in GB200 NVL72 (which has 14 TB of HBM) smaller 2T total param models shouldn’t be at a disadvantage compared to even smaller 250B total param models, because most of the data that needs to pass through chips will be KV cache (similarly for gpt-oss-120B and the 8-chip servers). The number of active params only matters indirectly, in that a model with more active params will tend to have a higher model dimension and will want to use more memory per token in its KV cache, which does directly increase the cost of output tokens.