Its pretraining recipe is now public, so it could get reproduced with much more compute soon. It might also suggest that scaling of pretraining has already plateaued, that leading labs have architectures that are at least as good as DeepSeek-V3, pump 20-60 times more compute into them, and get something only marginally better.
There is a Feb 2024 paper that predicts high compute multipliers from using more finer-grained experts in MoE models, optimally about 64 experts activated per token at 1e24-1e25 FLOPs, whereas MoE models with known architecture usually have 2 experts activated per token. DeepSeek-V3 has 8 routed experts activated per token, a step in that direction.
On the other hand, things like this should’ve already been tested at the leading labs, so the chances that it’s a new idea being brought to attention there seem slim. Runners-up like xAI and Meta might find this more useful, if that’s indeed the reason, rather than extremely well-done post-training or even pretraining dataset construction.
Its pretraining recipe is now public, so it could get reproduced with much more compute soon. It might also suggest that scaling of pretraining has already plateaued, that leading labs have architectures that are at least as good as DeepSeek-V3, pump 20-60 times more compute into them, and get something only marginally better.
Ah so there could actually be a large compute overhang as it stands?
There is a Feb 2024 paper that predicts high compute multipliers from using more finer-grained experts in MoE models, optimally about 64 experts activated per token at 1e24-1e25 FLOPs, whereas MoE models with known architecture usually have 2 experts activated per token. DeepSeek-V3 has 8 routed experts activated per token, a step in that direction.
On the other hand, things like this should’ve already been tested at the leading labs, so the chances that it’s a new idea being brought to attention there seem slim. Runners-up like xAI and Meta might find this more useful, if that’s indeed the reason, rather than extremely well-done post-training or even pretraining dataset construction.