Forecasting ML Benchmarks in 2023

Link post

Thanks to Collin Burns, Ruiqi Zhong, Cassidy Laidlaw, Jean-Stanislas Denain, and Erik Jones, who generated most of the considerations discussed in this post.

Previously, I evaluated the accuracy of forecasts about performance on the MATH and MMLU (Massive Multitask) datasets. I argued that most people, including myself, significantly underestimated the rate of progress, and encouraged ML researchers to make forecasts for the next year in order to become more calibrated.

In that spirit, I’ll offer my own forecasts for state-of-the-art performance on MATH and MMLU. Following the corresponding Metaculus questions, I’ll forecast accuracy as of June 30, 2023. My forecasts are based on a one-hour exercise I performed with my research group, where we brainstormed considerations, looked up relevant information, formed initial forecasts, discussed, and then made updated forecasts. It was fairly easy to devote one group meeting to this, and I’d encourage other research groups to do the same.

Below, I’ll describe my reasoning for the MATH and MMLU forecasts in turn. I’ll review relevant background info, describe the key considerations we brainstormed followed, analyze those considerations, and then give my bottom-line forecast.

MATH

Background

Metaculus does a good job of describing the MATH dataset and corresponding forecasting question:

The MATH dataset is a dataset of challenging high school mathematics problems constructed by Hendrycks et al. (2021). Hypermind forecasters were commissioned to predict state-of-the-art performance on June 30, 2022, ’23, ‘24, and ‘25. The 2022 result of 50.3% was significantly outside forecasters’ prediction intervals, so we’re seeing what the updated forecasts are for 2023, ’24, and ’25.

What will be state-of-the-art performance on the MATH dataset in the following years?

These questions should resolve identically to the Hypermind forecasts:

“These questions resolve as the highest performance achieved on MATH by June 30 in the following years by an eligible model.

Eligible models may use scratch space before outputting an answer (if desired) and may be trained in any way that does not use the test set (few-shot, fine tuned, etc.). The model need not be publicly released, as long as the resulting performance itself is reported in a published paper (on arxiv or a major ML conference) or through an official communication channel of an industry lab (e.g. claimed in a research blog post on the OpenAI blog, or a press release). In case of ambiguity, the question will resolve according to Jacob Steinhardt’s expert judgement.”

It’s perhaps a bit sketchy for me to be both making and resolving the forecast, but I expect in most cases the answer will be unambiguous.

Key Considerations

Below I list key considerations generated during our brainstorming:

  • Why did Minerva do well on MATH? Is it easy to scale up those methods? Is there other low-hanging fruit?

  • What kinds of errors is Minerva making? Do they seem easy or hard to fix?

  • Minerva was trained on arXiv and other sources of technical writing. How much additional such data could be generated?

  • Are there other methods that could lead to improvement on mathematical reasoning?

    • Possibilities: self-supervised learning, verifiers, data retrieval

  • Base rates: What has been the historical rate of progress on MATH?

  • Base rates: How does progress typically occur on machine learning datasets (especially NLP datasets)? If there is a sudden large improvement, does that typically continue, or level off?

  • How much will people work on improving MATH performance?

Analyzing Key Consideratoins

Why did Minerva do well? How much low-hanging fruit is there?

Minerva incorporated several changes that improved performance relative to previous attempts:

  1. Chain-of-thought prompting: this started in earnest with PALM, which Minerva is based on. There are straightforward ways to continue improving on it, e.g. chain-of-thought is currently few-shot and one could fine-tune to improve further, or one could use follow-up prompts to try to fix errors in the first prompt’s reasoning. It seems to be an active area of interest in ML so it’s likely there will be further progress here.

    • Based on Figure 10 of the PaLM paper, chain-of-thought currently gives a 40% boost on GSM-8k (perhaps an easy data set), and gains ranging from around 5%-25% on other tasks.

  2. Training on more math data: Minerva uses continued pre-training on math and other technical data. Right now this data is 5% as large as the original pretraining corpus. If there is more available data, it could be a straightforward way to get further improvements (see below for estimates of available data).

  3. Large models: the largest version of Minerva is very large (540B parameters) and does 7% better than the 62B parameter model. It seems like it would be relatively expensive to continue improving performance solely by scaling up (but see below on undertraining).

  4. Minerva’s web scraping avoids filtering out math and LaTeX, while previous scrapers often did.

  5. Using majority vote to aggregate over multiple generated solutions improves performance significantly: by almost 17% for both the 62B and 540B parameter models. “Intuitively, the reason majority voting improves performance is that while there are many ways to answer a question incorrectly, there are typically very few ways to answer correctly.”

    • One could imagine a better aggregation method as majority vote is fairly simple. How much better could we hope to do? The Minerva paper estimates that the right answer (with correct reasoning) occurs among the top 256 samples at least 68% of the time for the 62B parameter model, which is 40% above the top-1 baseline and 25% above the majority vote method. So it seems very plausible to improve this further.

Other log-hanging fruit:

  • Minerva is based on PALM, which was significantly undertrained according to the Chinchilla paper. So at the same compute budget, a better-trained model would have higher performance. How much does this matter? Based on Figure 3 of the PaLM paper, it looks like Chinchilla is about as good as a PaLM-style model that is 4x bigger.


    I’d guess that corresponds to about a 4% improvement in the case of the MATH dataset (since making the model 8.7x bigger was a 7% improvement).

  • Minerva itself is also undertrained: Table 2 of the Minerva paper states that it only used 26B tokens during fine-tuning, which is less than one epoch (the fine-tuning dataset had 38.5B tokens). I wouldn’t be surprised if training further also gave a ~5% improvement, although the actual amount could be significantly larger or smaller.

Overall summary: the lowest-hanging fruit towards further improvement would be (in order):

  • Improving over majority vote (up to 25% improvement, easy to imagine an 8% improvement),

  • fine-tuning on more data (unknown improvement, intuitively around 5%),

  • a better-trained version of PaLM (expensive so not clear it will happen, but probably a 4% improvement),

  • improving chain-of-thought prompting (easy to imagine a 3-5% improvement, possible to imagine a 10% improvement),

  • training a larger model (not obvious it will happen, but probably a couple percent improvement if so),

  • perhaps small gains from improving web scraping as well as tokenization.

Aggregating these, it feels easy to imagine a >14% improvement, fairly plausible to get >21%, and >28% doesn’t seem out of the question. Concretely, conditional on Google or some other large organization deciding to try to further improve MATH performance, my prediction of how much they would improve it in the next year would be:

  • 25th percentile: 14%

  • 50th percentile: 21%

  • 80th percentile: 28%

(This prediction is specifically using the “how much low-hanging fruit” frame. I’ll also consider other perspectives, like trend lines, and average with these other perspectives when making a final forecast.)

What kinds of errors is Minerva making? Do they seem easy or hard to fix?

As noted above, the 62B parameter model has best-of-256 performance (filtered for correct reasoning) of at least 68%. My guess is that the true best-of-256 performance is in the low-to-mid 70s for 62B. Since Minerva-540B is 7% better than Minerva-62B, the model is at least capable of generating the correct answer around 80% of the time.

We can also look at errors by type of error. For instance, we estimated that calculation errors accounted for around 30% of the remaining errors (or around 15% absolute performance). These are probably fairly easy to fix.

In the other direction, the remaining MATH questions are harder than the ones that Minerva solves currently. I couldn’t find results grouped by difficulty, but Figure 4 of the Minerva paper shows lower accuracy for harder subtopics such as Intermediate Algebra.

How much additional data could be generated for training?

We estimated that using all of arXiv would only generate about 10B words of mathematical content, compared to the 20B tokens used in Minerva. At a conversation rate of 2 tokens/​word, this suggests that Minerva is already using up most relevant content on arXiv. I’d similarly guess that Minerva makes use of most math-focused web pages currently on the internet (it looks for everything with MathJax). I’d guess it’s possible to find more (e.g. math textbooks) as well as to synthetically generate mathematical exposition, and probably also to clean the existing data better. But overall I’d guess there aren’t huge remaining gains here.

Could other methods improve mathematical reasoning?

For math specifically, it’s possible to use calculators and verifiers, which aren’t used by Minerva but could further improve performance. Table 9 of the PaLM paper shows that giving PaLM a calculator led to a 4% increase in performance on GSM8K (much smaller than the gains from chain-of-thought prompting).

In the same table, we see that GPT-3 gets a 20% gain using a task-specific verifier. Given that the MATH problems are fairly diverse compared to GSM8K, I doubt it will be easy to write an effective verifier for that domain, and it’s unclear whether researchers will seriously try in the next year. The calculator seems more straightforward and I’d give a ~50% chance that someone tries it (conditional on there being at least one industry lab paper that focuses on math in the next year).

Historical Rate of Progress on MATH

  • As of 03/​05/​2021: 6.9%.

  • As of 06/​30/​2022: 50.3%.

This is a roughly 2.9% accuracy gain per month (but almost certainly will be slower in future). Taking this extrapolation literally would give 85.1% for 06/​30/​2023.

Historical Rate of Progress on Other Datasets

The Dynabench paper plots historical progress on a number of ML datasets, normalized by baseline and ceiling performance (see Figure 1, reproduced below).

We seem to often see immediate huge gains, while the next ones are somewhat slower.

Here’s another benchmark for reference. It got 67% → 86% within 1-2 months, then took 4 months to break 90%.

Overall, it seems clear we should expect some sort of slow-down. In some cases, the slow-down was huge. I think progress should not slow down that much in this case since there’s still lots of low-hanging fruit. Maybe progress is 60% as fast as before? So that would give us 71% on 06/​30/​2023.

How much will people work on improving MATH performance?

Two sources of progress:

  • General increased scaling of language models

  • Specific efforts to improve math /​ quantitative reasoning

How many language papers have been released historically?

  • GPT-2: 02/​2019 (OpenAI)

  • GPT-3: 03/​2020 (OpenAI)

  • UnifiedQA: 05/​2020 (AI2)

  • Gopher: 12/​2021 (DeepMind)

  • Chinchilla: 03/​2022 (DeepMind)

  • PaLM: 04/​2022 (Google)

(This only counts language models that achieved broad state-of-the-art performance. E.g. I’m ignoring OPT, BLOOM, GPT-J, etc.)

By this count, there have been 6 papers since the beginning of 2019. So base rate of around 1.7 /​ year. If we use a Poisson process, predicts that we will see 0 new papers with probability 18%, 1 with probability 31%, 2 with probability 26%, and >2 with probability 25%.

What about math-specific work? Harder to measure what “counts” (lots of math papers but how many are large-scale /​ pushing state-of-the-art?). Intuitively I’d expect more like 1.1 such papers per year. So around 33% chance of zero, 37% chance of 1, 20% chance of 2, 10% chance of >2.

An important special case is if there are no developments on either the language models or the math-specific front. Under the above model these have probabilities 18% and 33%, and are probably positively correlated. Additionally, it’s possible that language model papers might not bother to evaluate on MATH or might not use all the ideas in the Minerva paper (and thus fail to hit SOTA). Combining these considerations, I’d forecast around a 12% chance that there is no significant progress on MATH on any front.

Bottom-Line Forecast

From the above lines of reasoning, we have a few different angles on the problem:

  • Looking at rates of publication of language model papers suggests a 12% chance of no major developments at all (e.g. between 0% and 5% progress).

  • Thinking in detail about possible sources of improvements gives something like 25th percentile: 14% gain (to 64%); 50th percentile: 21% gain (to 71%); 80th percentile: 28% gain (to 78%). But these were all conditional on some progress happening, so should adjust down by the 12% chance of no progress at all.

  • Extrapolating base rates and adjusting for slowdowns in progress gives a forecast of 71% (with probably a ceiling forecast of 85%).

  • Looking at how easy it would be to fix current flaws suggests that 15% should be relatively easy (calculation errors). Up to 30% would “only” require better re-ranking (i.e., a correct solution is in the top 256 ones generated).

  • I’ve underestimated progress in the past so should potentially adjust upwards.

If I intuitively combine these, I produce the following forecast:

  • 12th percentile: 55%

  • 33rd percentile: 63%

  • 50th percentile: 71%

  • 80th percentile: 80%

  • 90th percentile: 89%

The Metaculus community is at 74 median, upper 75% of 83. So I’ll adjust up slightly more. New forecast adjusted towards community prediction:

  • 10th percentile: 55%

  • 33rd percentile: 66%

  • Median: 73%

  • 80th percentile: 84%

  • 90th percentile: 90%

Rough approximation of this distribution on Metaculus (red is me, green is the community prediction):


Interestingly, Hypermind forecasts a much smaller median of 64.1%.

MMLU Forecast

Background

Again borrowing from Metaculus:

The Massive Multitask Language Understanding (MMLU) dataset is a dataset of high school, college, and professional multiple choice exams that test expert subject knowledge. It was constructed by Hendrycks et al. (2021). Hypermind forecasters were commissioned to predict state-of-the-art performance on June 30, 2022, ’23, ‘24, and ‘25. The 2022 result of 67.5% was significantly outside forecasters’ prediction intervals, so we’re seeing what the updated forecasts are for 2023, ’24, and ’25.

What will be state-of-the-art accuracy on the Massive Multitask dataset in the following years?

These questions should resolve identically to the Hypermind forecasts:

“These questions resolve as the highest performance achieved on MMLU by June 30 in the following years by an eligible model. Eligible models must not have been specifically trained on data from the MMLU dataset. A model need not be publicly released, as long as the resulting performance itself is reported in a published paper (on arxiv or a major ML conference) or through an official communication channel of an industry lab (e.g. claimed in a research blog post on the OpenAI blog, or a press release). If there’s uncertainty about whether something counts, we will defer to this leaderboard.”

Key Considerations

At a high level, these are fairly similar to those of the MATH dataset. Since more people have worked on MMLU and there’s been steadier progress, we rely more on base rates and less on detailed considerations of how one could improve it further.

  • Base rate: What has been the progress on MMLU to date?

  • Base rate: How does progress typically occur on machine learning datasets (especially NLP datasets)? If there is a sudden large improvement, does that typically continue, or level off? [Same as previous consideration for MATH]

  • The two models Chinchilla and Minerva do well on different subsets of MMLU. What happens if we combine them together?

  • How much other low-hanging fruit is there?

  • How much will people work on improving MMLU performance?

Analyzing Key Considerations

Historical Rate of Progress on MMLU

Below is a time series of MMLU results, taken from the MMLU leaderboard (note MMLU was published in Jan. 2021). I’ve bolded few-shot/​zero-shot results.

Model Date Average
Chinchilla (70B, few-shot) Mar 29, 2022 67.5
Gopher (280B, few-shot) Dec 8, 2021 60.0
GPT-3 (175B, fine-tuned) Jul 22, 2020 53.9
UnifiedQA Oct 7, 2020 48.9
GPT-3 (175B, few-shot) Jul 22, 2020 43.9
GPT-3 (6.7B, fine-tuned) 43.2
GPT-2 32.4

If we restrict to few-shot results, we see:

  • +7.5 from Dec → Mar (3 months)

  • +11.1 from Oct ’20 → Dec ’21 (14 months)

  • +16.1 from July ’20 → Dec ’21 (17 months)

It’s not clear which time horizon is best to use here. I came up with an approximate base rate of 1.2 pts /​ month.

Other notes:

  • Source of improvements: better training, more compute, maybe better pretraining data?

  • Fine-tuning seems to add 10 points, so a potentially easy source of low-hanging fruit.

Historical Rate of Progress on Other Datasets

We analyzed this already in the previous section on MATH. It seems like there’s usually an initial period of rapid progress, followed by a slow-down. However, MMLU has had enough attempts that I’d say it’s past the “huge initial gains” stage. Therefore, I don’t expect as much as a level-off compared to MATH, even though there is less obvious low-hanging fruit—maybe we’ll get 75% as fast of progress as before. This would suggest +10.8 points over the next year.

Combining Chinchilla and Minerva

The current SOTA of 67.5 comes from Chinchilla. But Minerva does much better than Chinchilla on the MMLU-STEM subset of MMLU. Here’s a rough calculation of how much taking max(Chinchilla, Minerva) would improve things:

  • Chinchilla gets 54.9% on MMLU-STEM

  • PaLM gets 58.7%

  • Minerva gets 75.0% with majority vote, 63.9% without

  • STEM is 19 /​ 57 of the tasks.

So adding in Minerva would add (75% − 54.9%) * 1957 = 6.7% points of accuracy.

Will this happen? It’s not obvious, since PaLM is owned by Google and Chinchilla is owned by DeepMind. At least one org would need to train a new model. I think there’s a good chance this happens, but not certain (~65% probability).

Other Low-Hanging Fruit

Result of a quick brainstorm:

  • External calculators + other STEM-specific improvements (similar to MATH)

  • Some of the chain-of-thought improvements could help with other parts of MMLU (beyond STEM) especially if it helps with error correction.

  • General knowledge retrieval

In addition, the STEM-specific improvements (e.g. Minerva) will continue to improve MMLU-STEM. Based on the MATH forecast above, on median I expect about half as much improvement over the next year as we saw from the Minerva paper, or around another 3% improvement on MMLU overall (since Minerva gave a 6.7% improvement).

We thought it was possible but unlikely that there are significant advances in general knowledge retrieval in the next year that also get used by MMLU (~20% probability).

How much will people work on improving MMLU performance?

Unlike MATH, there is nothing “special” that makes MMLU stand out from other language modeling benchmarks. So I’d guess most gains will come from general-purpose improvements to language models, plus a bit of STEM-specific improvement if people focus on quantitative reasoning.

Bottom-Line Forecast

In some sense, MMLU performance is already “at” 74.2% because of the Minerva result. Additional low-hanging fruit would push us up another 5 points to 79.2%. Alternately, simply extrapolating historical progress would suggest 10.8 points of improvement, or 85%. Putting these together, I’d be inclined towards a median of 83%.

If we instead say that progress doesn’t slow down at all, we’d get 89%.

As before, I’d give an 18% chance of no new SOTA language model papers, in which case MMLU performance likely stays between 67.5% and 74.2%. This also means we should adjust the previous numbers down a bit.

Overall forecast:

  • 18th percentile: 74%

  • 25th percentile: 77%

  • 50th percentile: 82%

  • 75th percentile: 89%

This seems pretty similar to the Metaculus community prediction, so I won’t do any further adjustment.


Interestingly, the Hypermind median is only at 72.5% right now. Given the ability to combine Minerva + Chinchilla, this intuitively seems too low to me.

Looking Ahead

My personal forecasts ended up being pretty similar to the Metaculus community forecasts, aside from me expecting slightly slower MATH progress (but only by about a percentage point). So, we can ask what Metaculus expects for 2024 and 2025 as well, as an approximation to what I “would” believe if I thought about it more.

MATH forecast (community prediction in green, top row of each cell):

MMLU forecast (community prediction in green):

So, on median Metaculus expects MATH to be at 83% in 2024 and at 88% in 2025. It expects MMLU to be at 88% in 2024 and at 93% (!) in 2025. The last one is particularly interesting: since MMLU tests domain-specific subject knowledge across many areas, it is predicting that a single model will be able to match domain-specific expert performance across a wide variety of written subject exams.

Do you agree with these forecasts? Disagree? I strongly encourage you to leave your own forecasts on Metaculus: here for MATH, and here for MMLU.