I disagree on DeekSeek and innovation. Yes R1 is obviously a reaction to o1, but its MoE model is pretty innovative, and it is Llama 4 that obviously copied DeepSeek. But yes I agree innovation is unpopular in China. But from interviews of DeepSeek founder Liang Wenfeng, we know DeepSeek was explicitly an attempt to overcome China’s unwillingness to innovate.
DeepSeek-V3′s MoE architecture is unusual in having high granularity, 8 active experts rather than the usual 1-2. Llama 4 Maverick doesn’t do that[1]. The closest thing is the recent Qwen3-235B-A22B, which also has 8 active experts.
As an example, Llama 4 Maverick models have 17B active parameters and 400B total parameters. … MoE layers use 128 routed experts and a shared expert. Each token is sent to the shared expert and also to one of the 128 routed experts.
I would roughly punt it into the category of “optimization”, not “innovation”. “Innovation” is something like transformers, instruct-training, or RL-on-CoTs. MoE scaling is an incremental-ish improvement.
Or, to put it in other words: it’s an innovation in the field of compute-optimal algorithms/machine learning. It’s not an AI innovation.
But from interviews of DeepSeek founder Liang Wenfeng, we know DeepSeek was explicitly an attempt to overcome China’s unwillingness to innovate
Yes, and we’re yet to see them succeed. And with the CCP having apparently turned its sights on them, that attempt may be thoroughly murdered already.
I disagree on DeekSeek and innovation. Yes R1 is obviously a reaction to o1, but its MoE model is pretty innovative, and it is Llama 4 that obviously copied DeepSeek. But yes I agree innovation is unpopular in China. But from interviews of DeepSeek founder Liang Wenfeng, we know DeepSeek was explicitly an attempt to overcome China’s unwillingness to innovate.
DeepSeek-V3′s MoE architecture is unusual in having high granularity, 8 active experts rather than the usual 1-2. Llama 4 Maverick doesn’t do that[1]. The closest thing is the recent Qwen3-235B-A22B, which also has 8 active experts.
From the release blog post:
I would roughly punt it into the category of “optimization”, not “innovation”. “Innovation” is something like transformers, instruct-training, or RL-on-CoTs. MoE scaling is an incremental-ish improvement.
Or, to put it in other words: it’s an innovation in the field of compute-optimal algorithms/machine learning. It’s not an AI innovation.
Yes, and we’re yet to see them succeed. And with the CCP having apparently turned its sights on them, that attempt may be thoroughly murdered already.