In February 2023, I wrote a post named GPT-4 Predictions which was an attempt to predict the properties and capabilities of OpenAI’s GPT-4 model using scaling laws and knowledge of past models such as GPT-3. Now that GPT-4 has been released, I’d like to evaluate these past predictions.
Unfortunately, since the GPT-4 technical report has limited information on GPT-4’s training process and model properties, I can’t evaluate all the predictions. Nevertheless, I believe I can evaluate enough of them right now to yield useful insights.
GPT-4 release date
OpenAI released GPT-4 on 14 March 2023.
I mentioned in the post that Metaculus predicted a 50% chance of GPT-4 being released by May 2023 and consequently, I expected the model to be released sometime around the middle of the year so the model was released earlier than I expected.
GPT-4 model properties
I predicted that GPT-4 would be a dense, text-only, transformer language model like GPT-3 trained using more compute and data with a similar number of parameters and a longer context window. OpenAI hasn’t yet published information such as the number of parameters in the model so I can’t evaluate these predictions yet.
My most obviously incorrect prediction was predicting that GPT-4 would be a text-only language model like GPT-3. Instead, GPT-4 is a multimodal model that accepts both text and images as inputs though it only outputs text.
Apart from that, I think my predictions about the model were mostly correct: GPT-4 is a pre-trained transformer language model trained using next-word prediction like its predecessors.
Training process
Number of GPUs used during training
Some people such as LawrenceC and gwern have noted in the post’s comments that GPT-4 was probably trained on 15,000 GPUs or more. Assuming this is true, my prediction that GPT-4 would be trained on 2,000 to 15,000 GPUs seems like an underprediction and consequently, I may have underpredicted GPT-4’s total training compute by about a factor of 2.
Training time
The OpenAI GPT-4 video states that GPT-4 finished training in August 2022. Given that GPT-3.5 finished training in early 2022 this suggests that GPT-4 was trained for about 4-7 months. I originally predicted that the training time would be 1-6 months which seems like an underprediction in retrospect.
GPT-4 performance
MMLU performance
Fortunately, both my post and the GPT-4 technical report referenced the MMLU benchmark. In the post, I predicted that GPT-4 could set a new record on the MMLU benchmark. I specifically predicted that GPT-4 could achieve 79.4% accuracy on the benchmark given my prediction of the model’s loss which is better than the previous record of 75.2% set by a fine-tuned version of PaLM.
GPT-4 in fact achieved 86.4% on the MMLU benchmark which is a new record and higher than I predicted. My prediction vs GPT-4’s actual accuracy on the MMLU benchmark is shown in the following graph. Note that since I don’t know GPT-4’s actual loss I used its predicted loss in the graph.
Figure: the red x’s show the estimated losses of past models such as GPT-3, PaLM, and Chinchilla on the x-axis and their accuracies on the benchmark on the y-axis. The blue line is a linear regression.
The percent error between my prediction and the true value is 8.1% which seems like a fairly accurate prediction [1].
GPT-4 writing ability
Based on GPT-3’s improvement trend from the GPT-3 paper, I also predicted that human evaluators would only be able to distinguish model-generated text from human text about 50% of the time. In other words, I predicted that GPT-4’s text would be indistinguishable from human-written text.
From my personal experience, GPT-4-generated text seems indistinguishable from human-written text though there doesn’t seem to be any quantitative evaluation of this metric for GPT-4 yet.
Context length
Given that GPT-3 and GPT-3.5 had context lengths of 2048 tokens and 4096 tokens respectively, my guess was that GPT-4 would have a context length of 8192 tokens.
According to the OpenAI API, one of the GPT-4 models does indeed have a context length of 8192 tokens. However, there is another model with 32,768 tokens. Therefore, my prediction was partially correct but also underestimated the increase in context length.
Prediction framework
My predictions of GPT-4’s performance were based on the following assumptions:
Model loss can be accurately calculated using scaling laws that can estimate a model’s loss given inputs such as the number of parameters in the model, the amount of training compute, and training data.
There is a power law relationship between increases in these inputs and decreases in loss.
Decreases in model loss are linearly correlated with improved performance as measured by benchmarks such as MMLU [2].
GPT-4 includes no significant algorithmic advances that would significantly increase the model’s compute efficiency, data efficiency, or performance.
The prediction framework is summarized in this diagram:
Despite the fact that these simplifying assumptions could have limited the accuracy of the prediction model, I believe I was able to at least predict GPT-4′s loss and some of GPT-4’s capabilities fairly accurately given knowledge of scaling laws derived from the behavior of smaller model and knowledge of the current capabilities, model properties, and training process of GPT-3 and similar models.
Similarly, the GPT-4 technical report includes details on how OpenAI used smaller models to predict GPT-4’s performance:
“A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4’s performance.”
Given that OpenAI has full access to all information about GPT-3 and GPT-4, their predictions were probably more accurate than mine.
Limitations of the framework
I think the biggest limitation of the framework is its neglect of algorithmic advances such as the introduction of image inputs to the GPT-4 model. Not taking algorithmic advances into account could also explain why I underestimated GPT-4′s performance improvement on the MMLU benchmark.
Although the average capabilities of language models tend to scale smoothly given more resources, specific capabilities can increase abruptly because of emergent capabilities. Therefore, a model that predicts linear improvements on certain capabilities in the short term could merely be a short tangent in a more complex non-linear model. This suggests that predicting specific capabilities in the long term is significantly more difficult.
Conclusions
GPT-4 was released earlier than I expected and consequently, I published the “GPT-4 Predictions” post just a month before the release of GPT-4 which possibly limited its utility.Given that the post was mostly based on data from 2020 and 2021 on models such as GPT-3, I think I could have made the predictions much earlier without a significant loss of accuracy. For example, if I had written the post in early 2021 it would have been published 2 years before the release of GPT-4.
I focused on benchmarks such as MMLU but I can now see from the GPT-4 technical report that human tests such as the SAT are also useful for evaluating language models.
I didn’t make any predictions on the safety improvements of GPT-4 over GPT-3 and such predictions could have been insightful.
My predictions seem to be evidence that it’s possible to use scaling laws and other predictable quantitative methods to predict the general performance of language models at least in the short term.
Given the increased effect of algorithmic advances on ML capabilities in the long term and the inherent unpredictability of scientific progress, I expect accurately predicting the capabilities of ML models in the long term (>5 years) to be much more challenging [3].
This Anthropic paper notes that GPT-3′s MMLU performance improves very slowly when the model is below 10B parameters and then more quickly above that threshold which is a non-linear relationship.
Retrospective on ‘GPT-4 Predictions’ After the Release of GPT-4
In February 2023, I wrote a post named GPT-4 Predictions which was an attempt to predict the properties and capabilities of OpenAI’s GPT-4 model using scaling laws and knowledge of past models such as GPT-3. Now that GPT-4 has been released, I’d like to evaluate these past predictions.
Unfortunately, since the GPT-4 technical report has limited information on GPT-4’s training process and model properties, I can’t evaluate all the predictions. Nevertheless, I believe I can evaluate enough of them right now to yield useful insights.
GPT-4 release date
OpenAI released GPT-4 on 14 March 2023.
I mentioned in the post that Metaculus predicted a 50% chance of GPT-4 being released by May 2023 and consequently, I expected the model to be released sometime around the middle of the year so the model was released earlier than I expected.
GPT-4 model properties
I predicted that GPT-4 would be a dense, text-only, transformer language model like GPT-3 trained using more compute and data with a similar number of parameters and a longer context window. OpenAI hasn’t yet published information such as the number of parameters in the model so I can’t evaluate these predictions yet.
My most obviously incorrect prediction was predicting that GPT-4 would be a text-only language model like GPT-3. Instead, GPT-4 is a multimodal model that accepts both text and images as inputs though it only outputs text.
Apart from that, I think my predictions about the model were mostly correct: GPT-4 is a pre-trained transformer language model trained using next-word prediction like its predecessors.
Training process
Number of GPUs used during training
Some people such as LawrenceC and gwern have noted in the post’s comments that GPT-4 was probably trained on 15,000 GPUs or more. Assuming this is true, my prediction that GPT-4 would be trained on 2,000 to 15,000 GPUs seems like an underprediction and consequently, I may have underpredicted GPT-4’s total training compute by about a factor of 2.
Training time
The OpenAI GPT-4 video states that GPT-4 finished training in August 2022. Given that GPT-3.5 finished training in early 2022 this suggests that GPT-4 was trained for about 4-7 months. I originally predicted that the training time would be 1-6 months which seems like an underprediction in retrospect.
GPT-4 performance
MMLU performance
Fortunately, both my post and the GPT-4 technical report referenced the MMLU benchmark. In the post, I predicted that GPT-4 could set a new record on the MMLU benchmark. I specifically predicted that GPT-4 could achieve 79.4% accuracy on the benchmark given my prediction of the model’s loss which is better than the previous record of 75.2% set by a fine-tuned version of PaLM.
GPT-4 in fact achieved 86.4% on the MMLU benchmark which is a new record and higher than I predicted. My prediction vs GPT-4’s actual accuracy on the MMLU benchmark is shown in the following graph. Note that since I don’t know GPT-4’s actual loss I used its predicted loss in the graph.
The percent error between my prediction and the true value is 8.1% which seems like a fairly accurate prediction [1].
GPT-4 writing ability
Based on GPT-3’s improvement trend from the GPT-3 paper, I also predicted that human evaluators would only be able to distinguish model-generated text from human text about 50% of the time. In other words, I predicted that GPT-4’s text would be indistinguishable from human-written text.
From my personal experience, GPT-4-generated text seems indistinguishable from human-written text though there doesn’t seem to be any quantitative evaluation of this metric for GPT-4 yet.
Context length
Given that GPT-3 and GPT-3.5 had context lengths of 2048 tokens and 4096 tokens respectively, my guess was that GPT-4 would have a context length of 8192 tokens.
According to the OpenAI API, one of the GPT-4 models does indeed have a context length of 8192 tokens. However, there is another model with 32,768 tokens. Therefore, my prediction was partially correct but also underestimated the increase in context length.
Prediction framework
My predictions of GPT-4’s performance were based on the following assumptions:
Model loss can be accurately calculated using scaling laws that can estimate a model’s loss given inputs such as the number of parameters in the model, the amount of training compute, and training data.
There is a power law relationship between increases in these inputs and decreases in loss.
Decreases in model loss are linearly correlated with improved performance as measured by benchmarks such as MMLU [2].
GPT-4 includes no significant algorithmic advances that would significantly increase the model’s compute efficiency, data efficiency, or performance.
The prediction framework is summarized in this diagram:
Despite the fact that these simplifying assumptions could have limited the accuracy of the prediction model, I believe I was able to at least predict GPT-4′s loss and some of GPT-4’s capabilities fairly accurately given knowledge of scaling laws derived from the behavior of smaller model and knowledge of the current capabilities, model properties, and training process of GPT-3 and similar models.
Similarly, the GPT-4 technical report includes details on how OpenAI used smaller models to predict GPT-4’s performance:
Given that OpenAI has full access to all information about GPT-3 and GPT-4, their predictions were probably more accurate than mine.
Limitations of the framework
I think the biggest limitation of the framework is its neglect of algorithmic advances such as the introduction of image inputs to the GPT-4 model. Not taking algorithmic advances into account could also explain why I underestimated GPT-4′s performance improvement on the MMLU benchmark.
Although the average capabilities of language models tend to scale smoothly given more resources, specific capabilities can increase abruptly because of emergent capabilities. Therefore, a model that predicts linear improvements on certain capabilities in the short term could merely be a short tangent in a more complex non-linear model. This suggests that predicting specific capabilities in the long term is significantly more difficult.
Conclusions
GPT-4 was released earlier than I expected and consequently, I published the “GPT-4 Predictions” post just a month before the release of GPT-4 which possibly limited its utility. Given that the post was mostly based on data from 2020 and 2021 on models such as GPT-3, I think I could have made the predictions much earlier without a significant loss of accuracy. For example, if I had written the post in early 2021 it would have been published 2 years before the release of GPT-4.
I focused on benchmarks such as MMLU but I can now see from the GPT-4 technical report that human tests such as the SAT are also useful for evaluating language models.
I didn’t make any predictions on the safety improvements of GPT-4 over GPT-3 and such predictions could have been insightful.
My predictions seem to be evidence that it’s possible to use scaling laws and other predictable quantitative methods to predict the general performance of language models at least in the short term.
Given the increased effect of algorithmic advances on ML capabilities in the long term and the inherent unpredictability of scientific progress, I expect accurately predicting the capabilities of ML models in the long term (>5 years) to be much more challenging [3].
As far as I know, the GPT-4 Technical Report also evaluates GPT-3.5 on the MMLU benchmark for the first time (source).
This Anthropic paper notes that GPT-3′s MMLU performance improves very slowly when the model is below 10B parameters and then more quickly above that threshold which is a non-linear relationship.
There is evidence showing that algorithmic progress increases predictably over time.