sairjy comments on GPT-4

sairjy 15 Mar 2023 15:54 UTC
6 points
1
We can give a good estimate of the amount of compute they used given what they leaked. The supercomputer has tens of thousands of A100s (25k according to the JP Morgan note), and they trained firstly GPT-3.5 on it 1 year ago and then GPT-4. They also say that they finish the training of GPT-4 in August, that gives a 3-4 months max training time.
25k GPUs A100s * 300 TFlop/s dense FP16 * 50% peak efficiency * 90 days * 86400 is roughly 3e25 flops, which is almost 10x Palm and 100x Chinchilla/GPT-3.
- Lukas Finnveden 4 Apr 2023 1:55 UTC
  3 points
  0
  Parent
  Where do you get the 3-4 months max training time from? GPT-3.5 was made available March 15th, so if they made that available immediately after it finished training, that would still have left 5 months for training GPT-4. And more realistically, they finished training GPT-3.5 quite a bit earlier, leaving 6+ months for GPT-4′s training.
- ZeroRelevance 18 Mar 2023 7:08 UTC
  2 points
  0
  Parent
  According to the Chinchilla paper, a compute-optimal model of that size should have ~500B parameters and have used ~10T tokens. Based on its GPT-4′s demonstrated capabilities though, that’s probably an overestimate.
  - sairjy 18 Mar 2023 20:47 UTC
    4 points
    0
    Parent
    Yeah agree, I think it would make sense that’s trained on 10x-20x the amount of tokens of GPT-3 so around 3-5T tokens (2x-3x Chinchilla) and that would give around 200-300b parameters giving those laws.
  - Lukas Finnveden 23 Mar 2023 16:53 UTC
    2 points
    0
    Parent
    Are you saying that you would have expected GPT-4 to be stronger if it was 500B+10T? Is that based on benchmarks/extrapolations or vibes?
    - ZeroRelevance 11 Apr 2023 12:46 UTC
      1 point
      0
      Parent
      Sorry for the late reply, but yeah, it was mostly vibes based on what I’d seen before. I’ve been looking over the benchmarks in the Technical Report again though, and I’m starting to feel like 500B+10T isn’t too far off. Although language benchmarks are fairly similar, the improvements in mathematical capabilities over the previous SOTA is much larger than I first realised, and seem to match a model of that size considering the conventionally trained PaLM and its derivatives’ performances.
- Ben Cottier 30 Mar 2023 16:29 UTC
  1 point
  0
  Parent
  What is the source for the “JP Morgan note”?
  - sairjy 30 Mar 2023 17:00 UTC
    1 point
    0
    Parent
    https://www.reddit.com/r/mlscaling/comments/11pnhpf/morgan_stanley_note_on_gpt45_training_demands/