Vladimir_Nesov comments on O O’s Shortform

Vladimir_Nesov 3 Jan 2025 17:29 UTC
9 points
2
Its pretraining recipe is now public, so it could get reproduced with much more compute soon. It might also suggest that scaling of pretraining has already plateaued, that leading labs have architectures that are at least as good as DeepSeek-V3, pump 20-60 times more compute into them, and get something only marginally better.
- O O 3 Jan 2025 20:53 UTC
  1 point
  0
  Parent
  Ah so there could actually be a large compute overhang as it stands?
  - Vladimir_Nesov 3 Jan 2025 21:54 UTC
    7 points
    2
    Parent
    There is a Feb 2024 paper that predicts high compute multipliers from using more finer-grained experts in MoE models, optimally about 64 experts activated per token at 1e24-1e25 FLOPs, whereas MoE models with known architecture usually have 2 experts activated per token. DeepSeek-V3 has 8 routed experts activated per token, a step in that direction.
    
    On the other hand, things like this should’ve already been tested at the leading labs, so the chances that it’s a new idea being brought to attention there seem slim. Runners-up like xAI and Meta might find this more useful, if that’s indeed the reason, rather than extremely well-done post-training or even pretraining dataset construction.