Similarly, I’d be interested in whether there’s a difference between western and Chinese models in this regard. It’s been a long held belief that Chinese models tend to rely more on distilling frontier models, so how do the estimated parameter counts for known Qwen models compare to estimated parameter counts like Gemma 4 for example? Likewise, is there a difference between the dense and almost-equally-sized MoE models in those respective families?
Similarly, I’d be interested in whether there’s a difference between western and Chinese models in this regard. It’s been a long held belief that Chinese models tend to rely more on distilling frontier models, so how do the estimated parameter counts for known Qwen models compare to estimated parameter counts like Gemma 4 for example? Likewise, is there a difference between the dense and almost-equally-sized MoE models in those respective families?