The model sizes were likely chosen based on typical inference constraints. Given that, they mostly care about maximizing performance, and aren’t too concerned about the compute cost, since training such small models is very affordable for them. So it’s worth going a long way into the regime of diminishing returns.
Possibly the model would’ve been too strong if it had more active params?
The number of total (rather than active) params influences the speed/cost of generating tokens, but reducing it too much stops helping at some point as the size of KV caches for all requests in a batch starts dominating. Reducing the number of active params (without changing attention or the number of total params) doesn’t influence generation of tokens, but it helps with the speed/cost of processing the initial prompt (or large tool outputs), which can be important for RAG or for loading large parts of a codebase in context.
So they might’ve targeted the number of total params (120B) and a level of benchmark performance, and found that 5.1B active params is when that happens. Not sure if 5.1B active params could really have been a target, but it’s a nice 6x compared to the other open weights models, if it really doesn’t destroy quality in less easily measurable ways.
What is the rationale to overtrain a model this much?
The model sizes were likely chosen based on typical inference constraints. Given that, they mostly care about maximizing performance, and aren’t too concerned about the compute cost, since training such small models is very affordable for them. So it’s worth going a long way into the regime of diminishing returns.
Possibly the model would’ve been too strong if it had more active params?
The number of total (rather than active) params influences the speed/cost of generating tokens, but reducing it too much stops helping at some point as the size of KV caches for all requests in a batch starts dominating. Reducing the number of active params (without changing attention or the number of total params) doesn’t influence generation of tokens, but it helps with the speed/cost of processing the initial prompt (or large tool outputs), which can be important for RAG or for loading large parts of a codebase in context.
So they might’ve targeted the number of total params (120B) and a level of benchmark performance, and found that 5.1B active params is when that happens. Not sure if 5.1B active params could really have been a target, but it’s a nice 6x compared to the other open weights models, if it really doesn’t destroy quality in less easily measurable ways.