DeepSeek-V3′s MoE architecture is unusual in having high granularity, 8 active experts rather than the usual 1-2. Llama 4 Maverick doesn’t do that[1]. The closest thing is the recent Qwen3-235B-A22B, which also has 8 active experts.
As an example, Llama 4 Maverick models have 17B active parameters and 400B total parameters. … MoE layers use 128 routed experts and a shared expert. Each token is sent to the shared expert and also to one of the 128 routed experts.
DeepSeek-V3′s MoE architecture is unusual in having high granularity, 8 active experts rather than the usual 1-2. Llama 4 Maverick doesn’t do that[1]. The closest thing is the recent Qwen3-235B-A22B, which also has 8 active experts.
From the release blog post: