We can give a good estimate of the amount of compute they used given what they leaked. The supercomputer has tens of thousands of A100s (25k according to the JP Morgan note), and they trained firstly GPT-3.5 on it 1 year ago and then GPT-4. They also say that they finish the training of GPT-4 in August, that gives a 3-4 months max training time.
25k GPUs A100s * 300 TFlop/s dense FP16 * 50% peak efficiency * 90 days * 86400 is roughly 3e25 flops, which is almost 10x Palm and 100x Chinchilla/GPT-3.
Where do you get the 3-4 months max training time from? GPT-3.5 was made available March 15th, so if they made that available immediately after it finished training, that would still have left 5 months for training GPT-4. And more realistically, they finished training GPT-3.5 quite a bit earlier, leaving 6+ months for GPT-4′s training.
According to the Chinchilla paper, a compute-optimal model of that size should have ~500B parameters and have used ~10T tokens. Based on its GPT-4′s demonstrated capabilities though, that’s probably an overestimate.
Yeah agree, I think it would make sense that’s trained on 10x-20x the amount of tokens of GPT-3 so around 3-5T tokens (2x-3x Chinchilla) and that would give around 200-300b parameters giving those laws.
Sorry for the late reply, but yeah, it was mostly vibes based on what I’d seen before. I’ve been looking over the benchmarks in the Technical Report again though, and I’m starting to feel like 500B+10T isn’t too far off. Although language benchmarks are fairly similar, the improvements in mathematical capabilities over the previous SOTA is much larger than I first realised, and seem to match a model of that size considering the conventionally trained PaLM and its derivatives’ performances.
We can give a good estimate of the amount of compute they used given what they leaked. The supercomputer has tens of thousands of A100s (25k according to the JP Morgan note), and they trained firstly GPT-3.5 on it 1 year ago and then GPT-4. They also say that they finish the training of GPT-4 in August, that gives a 3-4 months max training time.
25k GPUs A100s * 300 TFlop/s dense FP16 * 50% peak efficiency * 90 days * 86400 is roughly 3e25 flops, which is almost 10x Palm and 100x Chinchilla/GPT-3.
Where do you get the 3-4 months max training time from? GPT-3.5 was made available March 15th, so if they made that available immediately after it finished training, that would still have left 5 months for training GPT-4. And more realistically, they finished training GPT-3.5 quite a bit earlier, leaving 6+ months for GPT-4′s training.
According to the Chinchilla paper, a compute-optimal model of that size should have ~500B parameters and have used ~10T tokens. Based on its GPT-4′s demonstrated capabilities though, that’s probably an overestimate.
Yeah agree, I think it would make sense that’s trained on 10x-20x the amount of tokens of GPT-3 so around 3-5T tokens (2x-3x Chinchilla) and that would give around 200-300b parameters giving those laws.
Are you saying that you would have expected GPT-4 to be stronger if it was 500B+10T? Is that based on benchmarks/extrapolations or vibes?
Sorry for the late reply, but yeah, it was mostly vibes based on what I’d seen before. I’ve been looking over the benchmarks in the Technical Report again though, and I’m starting to feel like 500B+10T isn’t too far off. Although language benchmarks are fairly similar, the improvements in mathematical capabilities over the previous SOTA is much larger than I first realised, and seem to match a model of that size considering the conventionally trained PaLM and its derivatives’ performances.
What is the source for the “JP Morgan note”?
https://www.reddit.com/r/mlscaling/comments/11pnhpf/morgan_stanley_note_on_gpt45_training_demands/