Vladimir_Nesov comments on Fighting Obvious Nonsense About AI Diffusion

Vladimir_Nesov 15 May 2025 5:13 UTC
5 points
0
it is Llama 4 that obviously copied DeepSeek

DeepSeek-V3′s MoE architecture is unusual in having high granularity, 8 active experts rather than the usual 1-2. Llama 4 Maverick doesn’t do that^[1]. The closest thing is the recent Qwen3-235B-A22B, which also has 8 active experts.
1. ↩︎
  From the release blog post:
  
  As an example, Llama 4 Maverick models have 17B active parameters and 400B total parameters. … MoE layers use 128 routed experts and a shared expert. Each token is sent to the shared expert and also to one of the 128 routed experts.