Vladimir_Nesov comments on Vladimir_Nesov’s Shortform

Vladimir_Nesov 7 Aug 2025 4:13 UTC
3 points
0
Possibly the model would’ve been too strong if it had more active params?

The number of total (rather than active) params influences the speed/cost of generating tokens, but reducing it too much stops helping at some point as the size of KV caches for all requests in a batch starts dominating. Reducing the number of active params (without changing attention or the number of total params) doesn’t influence generation of tokens, but it helps with the speed/cost of processing the initial prompt (or large tool outputs), which can be important for RAG or for loading large parts of a codebase in context.

So they might’ve targeted the number of total params (120B) and a level of benchmark performance, and found that 5.1B active params is when that happens. Not sure if 5.1B active params could really have been a target, but it’s a nice 6x compared to the other open weights models, if it really doesn’t destroy quality in less easily measurable ways.