Simple reasons for DeepSeek V3 and R1 efficiencies:
32B active parameters instead of likely ~220B for GPT4 ⇒ 6.8x lower training and inference cost
8bits training instead of 16bits ⇒ 4x lower training cost
No margin on commercial inference ⇒ ?x maybe 3x
Multi-token training ⇒ ~2x training efficiency, ~3x inference efficiency, and lower inference latency by baking in “predictive decoding’, possibly 4x fewer training steps for the same number of tokens if predicting tokens only once
And additional cost savings from memory optimization, especially for long contexts ( Multi Head Latent Attention) ⇒ ?x
Nothing is very surprising (maybe the last bullet point for me because I know less about it).
The surprising part is why big AI labs were not pursuing these obvious strategies.
Int8 was obvious, the multi-token prediction was obvious, and more and smaller experts in MoE were obvious. All three have already been demonstrated and published in the literature. May be bottlenecked by communication, GPU usage, and memory for the largest models.
32B active parameters instead of likely ~220B for GPT4
It’s 37B instead of maybe 280B (non-expert parameters also count), but in any case the question is how this manages to maintain quality. If this wasn’t an issue, why not 8B active parameters, or 1M active parameters?
32B active parameters instead of likely ~220B for GPT4 ⇒ 6.8x lower training … cost
Doesn’t follow, training cost scales with the number of training tokens. In this case DeepSeek-V3 uses maybe 1.5x-2x more tokens than original GPT-4.
The training costs are maybe 5e24 FLOPs and 2e25 FLOPs, differ by 4x. DeepSeek-V3 is better than original GPT-4 though, you need to compare with GPT-4o, which almost certainly uses more compute in training than original GPT-4 (maybe 4x more, so maybe 16x more than DeepSeek-V3).
8bits training instead of 16bits ⇒ 4x lower training cost
FLOP/s for FP8 are almost always 2x the FLOP/s for BF16, not 4x.
Multi-token training ⇒ ~2x training efficiency
You still train on every token. There is an additional “layer” in model parameters that predicts the token-after-next (Figure 3 in the paper), so there’s a bit of overhead in training (not much, with 61 total layers). The results are better, but not that much better (Table 4).
> 32B active parameters instead of likely ~220B for GPT4 ⇒ 6.8x lower training … cost
Doesn’t follow, training cost scales with the number of training tokens. In this case DeepSeek-V3 uses maybe 1.5x-2x more tokens than original GPT-4.
Each of the points above is a relative comparison with more or less everything else kept constant. In this bullet point, by “training cost”, I mostly had in mind “training cost per token”:
32B active parameters instead of likely ~ 220 280B for GPT4 ⇒ 6.8 8.7x lower training cost per token.
If this wasn’t an issue, why not 8B active parameters, or 1M active parameters?
From what I remember, the training-compute optimal number of experts was like 64, given implementations a few years old (I don’t remember how many activated at the same time in this old paper). Given newer implementations and aiming for inference-compute optimality, it seems logical that more than 64 experts could be great.
You still train on every token.
Right, that’s why I wrote: “possibly 4x fewer training steps for the same number of tokens if predicting tokens only once” (assuming predicting 4 tokens at a time), but that’s not demonstrated nor published (given my limited knowledge on this).
From what I remember, the training-compute optimal number of experts was like 64
I think it only gets better with more experts if you keep the number of active parameters unchanged. Is there some setting where it gets worse after a while? There certainly are engineering difficulties and diminishing returns.
Also, the number of activated experts can vary (there are 8 activated routed experts in DeepSeek-V3 out of the total of 256), so “number of experts” doesn’t really capture the ratio of total to activated, probably not a good anchor by itself.
Given newer implementations and aiming for inference-compute optimality, it seems logical that more than 64 experts could be great.
This still doesn’t help with the question of why 37B active parameters is sufficient. Even with 100500 experts you can’t expect 1B active parameters to be sufficient to maintain GPT-4 quality. The rumor for original GPT-4 is that it has 2 activated experts out of 16 in total, so the ratio is 1:8, while for DeepSeek-V3 it’s 1:32.
that’s why I wrote: “possibly 4x fewer training steps for the same number of tokens if predicting tokens only once” (assuming predicting 4 tokens at a time), but that’s not demonstrated nor published (given my limited knowledge on this)
Not sure how to parse this, my point is that the number of training steps remains the same, training efficiency doesn’t significantly increase, there’s even slight overhead from adding the predict-the-token-after-next blocks of parameters. This is described in Section 2.2 of the paper. You get better speculative decoding at inference time (and also better quality of output), but training time is the same, not with 2x fewer or 4x fewer steps.
32B active parameters instead of likely ~ 220 280B for GPT4 ⇒ 6.8 8.7x lower training cost per token.
Simple reasons for DeepSeek V3 and R1 efficiencies:
32B active parameters instead of likely ~220B for GPT4 ⇒ 6.8x lower training and inference cost
8bits training instead of 16bits ⇒ 4x lower training cost
No margin on commercial inference ⇒ ?x maybe 3x
Multi-token training ⇒ ~2x training efficiency, ~3x inference efficiency, and lower inference latency by baking in “predictive decoding’, possibly 4x fewer training steps for the same number of tokens if predicting tokens only once
And additional cost savings from memory optimization, especially for long contexts ( Multi Head Latent Attention) ⇒ ?x
Nothing is very surprising (maybe the last bullet point for me because I know less about it).
The surprising part is why big AI labs were not pursuing these obvious strategies.
Int8 was obvious, the multi-token prediction was obvious, and more and smaller experts in MoE were obvious. All three have already been demonstrated and published in the literature. May be bottlenecked by communication, GPU usage, and memory for the largest models.
It’s 37B instead of maybe 280B (non-expert parameters also count), but in any case the question is how this manages to maintain quality. If this wasn’t an issue, why not 8B active parameters, or 1M active parameters?
Doesn’t follow, training cost scales with the number of training tokens. In this case DeepSeek-V3 uses maybe 1.5x-2x more tokens than original GPT-4.
The training costs are maybe 5e24 FLOPs and 2e25 FLOPs, differ by 4x. DeepSeek-V3 is better than original GPT-4 though, you need to compare with GPT-4o, which almost certainly uses more compute in training than original GPT-4 (maybe 4x more, so maybe 16x more than DeepSeek-V3).
FLOP/s for FP8 are almost always 2x the FLOP/s for BF16, not 4x.
You still train on every token. There is an additional “layer” in model parameters that predicts the token-after-next (Figure 3 in the paper), so there’s a bit of overhead in training (not much, with 61 total layers). The results are better, but not that much better (Table 4).
Thanks for your corrections, that’s welcome
Each of the points above is a relative comparison with more or less everything else kept constant. In this bullet point, by “training cost”, I mostly had in mind “training cost per token”:
32B active parameters instead of likely ~
220280B for GPT4 ⇒6.88.7x lower training cost per token.From what I remember, the training-compute optimal number of experts was like 64, given implementations a few years old (I don’t remember how many activated at the same time in this old paper). Given newer implementations and aiming for inference-compute optimality, it seems logical that more than 64 experts could be great.
Right, that’s why I wrote: “possibly 4x fewer training steps for the same number of tokens if predicting tokens only once” (assuming predicting 4 tokens at a time), but that’s not demonstrated nor published (given my limited knowledge on this).
I think it only gets better with more experts if you keep the number of active parameters unchanged. Is there some setting where it gets worse after a while? There certainly are engineering difficulties and diminishing returns.
Also, the number of activated experts can vary (there are 8 activated routed experts in DeepSeek-V3 out of the total of 256), so “number of experts” doesn’t really capture the ratio of total to activated, probably not a good anchor by itself.
This still doesn’t help with the question of why 37B active parameters is sufficient. Even with 100500 experts you can’t expect 1B active parameters to be sufficient to maintain GPT-4 quality. The rumor for original GPT-4 is that it has 2 activated experts out of 16 in total, so the ratio is 1:8, while for DeepSeek-V3 it’s 1:32.
Not sure how to parse this, my point is that the number of training steps remains the same, training efficiency doesn’t significantly increase, there’s even slight overhead from adding the predict-the-token-after-next blocks of parameters. This is described in Section 2.2 of the paper. You get better speculative decoding at inference time (and also better quality of output), but training time is the same, not with 2x fewer or 4x fewer steps.
It’s 37B active parameters, not 32B.