Vladimir_Nesov comments on Aaron_Scher’s Shortform

Vladimir_Nesov 21 Aug 2025 15:28 UTC
3 points
0
It’s instructive to look at API provider prices for open weights models (such as DeepInfra, Fireworks), where model shapes are known and competition ensures that margins are not absurd, keeping in mind that almost all tokens are input tokens (for example Sonnet 4 gets 98% of API tokens as input tokens). So the margins for input tokens will be more important for the API provider, and the cost there depends on active params (but not total params). While costs for output tokens depend on context size of a particular query and total params (and essentially don’t depend on active params), but get much less volume.

As an anchor, Qwen 3 Coder is a 480B-A35B model, and DeepInfra serves it at $0.40/$1.60 per 1M input/output in FP8. In principle, an A35B model needs 7e16 FLOPs per 1M input tokens, which at 50% compute utilization in FP8 needs 70 H100-seconds and should cost $0.04 at $2 per H100-hour. But probably occupancy is far from perfect. Unfortunately, there aren’t any popular open weights MoE models with a lot of active params for a plausibly more direct comparison with closed frontier models.
- eggsyntax 21 Aug 2025 16:39 UTC
  2 points
  0
  Parent
  almost all tokens are input tokens (for example Sonnet 4 gets 98% of API tokens as input tokens).
  
  [EDIT: I was being silly and not considering the entire input being fed back to the LLM as context]
  A bit tangential, but this is really wild. I’m confident that can’t be true for chat sessions, so it has to be API usage. It does seem to vary substantially by model, eg for R1 it looks to be more like 80–85%. It makes some sense that it would be especially high for Anthropic, since they’ve focused on the enterprise market.
  What kinds of use cases are nearly all input? The main one that I immediately see is using LLMs essentially as classifiers; eg you show an LLM a post by a user and ask whether it’s toxic, or show it a bunch of log data and ask whether it contains errors. Maybe coding is another, where the LLM reads lots of code and then writes a much smaller amount, or points out bugs. Are there others that seem plausibly very common?
  - dirk 21 Aug 2025 16:48 UTC
    4 points
    2
    Parent
    I suspect the models’ output tokens become input tokens when the conversation proceeds to the next turn; certainly my API statistics show several times as many input tokens as output despite the fact that my responses are invariably shorter than the models’.
    - eggsyntax 21 Aug 2025 17:14 UTC
      3 points
      0
      Parent
      Thanks, that’s a really great point, and I feel silly for not considering it.
      I’m not entirely sure offhand how to model that. Naively, the entire conversation is fed in as context on each forward pass, but I think KV caching means that’s not entirely the right way to model it (and how cached tokens are billed probably varies by model and provider). I’m also not sure to what extent those caches persist across conversational turns (and even if they are, presumably they’re dropped once the user hasn’t responded again for some length of time).
- Aaron_Scher 21 Aug 2025 16:22 UTC
  1 point
  0
  Parent
  So the margins for input tokens will be more important for the API provider, and the cost there depends on active params (but not total params). While costs for output tokens depend on context size of a particular query and total params (and essentially don’t depend on active params), but get much less volume.
  Could you explain the reasoning behind this? Or link to an existing explanation?
  - Vladimir_Nesov 21 Aug 2025 18:01 UTC
    5 points
    0
    Parent
    There’s a DeepMind textbook (the estimate for output tokens is in equation (1) of the inference chapter), though it focuses of TPUs and doesn’t discuss issues with low total HBM in Nvidia’s older 8-chip servers (which somewhat go away with GB200 NVL72). (I previously tried explaining output token generation a bit in this comment.)
    
    Basically, all else equal chips compute much faster than they can get input data for that computation (even from HBM), but if they are multiplying sufficiently large matrices, then the number of necessary operations gets much greater than the amount of input/output data, and so it becomes possible to feed the chips. But when generating output tokens, you are only computing a few tokens per query, at the very end of a context, while you need to keep the whole context in HBM in the form of KV cache (intermediate data computed from the context), that can run into gigabytes per query, with HBM per server being only about 1 TB (in 8-chip Nvidia servers). As a result, you can’t fit too many queries in a server at the same time, the total number of tokens being generated at the same time gets too low, and the time is mostly spent on the data moving between HBM and the chips, rather than on the chips computing things.
    
    The amount of computation is proportional to the number of active params, but you’d need to pass most of the total params (as well as all KV cache for all queries) through the chips each time you generate another batch of tokens. And since computation isn’t the taut constraint, the number of active params won’t directly matter.
    
    The number of total params also won’t strongly matter if it’s not a major portion of HBM compared to KV cache, so for example in GB200 NVL72 (which has 14 TB of HBM) smaller 2T total param models shouldn’t be at a disadvantage compared to even smaller 250B total param models, because most of the data that needs to pass through chips will be KV cache (similarly for gpt-oss-120B and the 8-chip servers). The number of active params only matters indirectly, in that a model with more active params will tend to have a higher model dimension and will want to use more memory per token in its KV cache, which does directly increase the cost of output tokens.