This is very valuable. I suggest putting this content on Arxiv (even it’s less formal that the typical paper).
It could be useful to look at performance of GPT-3 on foreign languages. We know roughly how long it takes humans to reach a given level at a foreign language. E.g. You might find GPT-3 is at a level on 15 different languages that would take a smart human (say) 30 months to achieve (2 months per language). Foreign languages are just a small fraction of the training data.
A few points:
Current models do pretty well on tricky math problems (Minerva), coding competition problems (AlphaCode), and multiple-choice quizzes at college level (MMLU).
In some ways, the models’ ability to learn from data is far superior to humans. For example, models trained mostly on English text are still pretty good at Spanish, while English speakers in parts of the US who hear Spanish (passively) every week of their lives usually retain almost nothing. The same is true for being able to imitate other styles or dialects of English, and for programming languages. (Humans after their earlier years can spend years hearing a foreign language everyday and learn almost nothing! Most people need to make huge efforts to learn.)
RNNs are much worse than transformers at in-context learning. It’s not just a difference in generative text quality. See this study by DeepMind: https://twitter.com/FelixHill84/status/1524352818261499911
Very helpful post, thanks!Are there some meta-level lessons about forecasting a dataset like MATH? IIRC, at the time of these forecasts, the only results were GPT2-finetune and GPT3 few-show (without chain-of-thought and self-consistency). For GPT-2, the accuracy scores were <15% for nearly all subjects and difficulty levels. This may be consistent with GPT-2 either not really understanding questions or being so weak at basic arithmetic that it has no chance for most questions. Given that performance was so low and that not many models/setups had been tried, there’s reason to have a wider distribution on future results. I would still guess that human expert level scores (>95%) should have had very low probability, but even (say) a score of 80% should have had more than 5% chance. (I realize this is posthoc—I’m not claiming to have made explicit predictions like this). A good source of baserates/priors would be to look at how performance improves on benchmarks after the paper introducing the benchmark. One example that comes to mind is Lambada, where performance went from 7.3% in the initial paper to 49% within a year. It’d be cool for someone to plot data from a bunch of benchmarks. Papers with Code will be very helpful but has some missing data. (We might also expect jumpier performance for math-related tasks because once you can do 2-digit arithmetic or elementary algebra reliably then many problems are opened up).
There’s a new Metaculus question on this. The median for near human-level on the exact set of forecasting questions we used is currently 2026. Another relevant question is how well AI will vs crowdforecasts when predicting new questions (e.g. 2023-2024 questions). I’d be excited for people to do more thinking about how much AI will improve at forecasting in coming years.
Nice post. I generally recommend looking at the model probabilities or taking multiple samples when evaluating a model. For example, does the model give the answer “Joe” 99% probability or close to 50%?
This is a distribution of math problems GPT-3 wasn’t finetuned on. Yet it’s able to few-shot generalize and perform well. This is an amazing level of robustness relative to 2018 deep learning systems. I don’t see why scaling and access to external tools (e.g. to perform long calculations) wouldn’t produce the kind of robustness you have in mind.
I’m somewhat skeptical that models will actually be able to robustly learn these kinds of abstractions with a reasonable amount of scaling
GPT-3 (without external calculators) can do very well on math word problems (https://arxiv.org/abs/2206.02336) that combine basic facts about the world with abstract math reasoning. Why think that the kind of causal reasoning humans do is out of reach of scaling (especially if you allow external calculators)? It doesn’t seem different in kind from these math word problems.
when can/do foundation models internalize explicitly stated knowledge
Some human causal reasoning is explicit. Humans can’t do complex and exact calculations using System 1 intuition, and neither can we do causal reasoning of any sophistication using System 1. The prior over causal relations (e.g. that without looking at any data ‘smoking causes cancer’ is way more likely than the reverse) is more about general world-model building, and maybe there’s more uncertainty about how well scaling learns that.
I agree my last point is more speculative. The question is whether vast amounts of pre-trained data + a smaller amount of finetuning by online RL substitutes for the human experience. Given the success of pre-training so far, I think it probably will.
Note that the modern understanding of causality in stats/analytic philosophy/Pearl took centuries of intellectual progress—even if it seems straightforward. Spurious causal inference seems ubiquitous among humans unless they have learned—by reading/explicit training—about the modern understanding. Your examples from human childhood (dropping stuff) seem most relevant to basic physics experiments and less to stochastic relationships between 3 or more variables.
In the pre-training set, there are lots of places where humans talk about causality (both informally and more formally in myriad academic papers). So a model would ultimately need to learn abstract stuff about causality (e.g. correlation is not causation, arrow of time, causes are local, etc) and concrete causal facts (the moon causes tides, tiny organisms cause mold, etc). Given this knowledge, it’s plausible a model M could make reasonable guesses for questions like, “What happens when a model with [properties of model M] starts interacting with the world?” These guesses would be improved by finetuning by RL on actual interaction between M and the world.
(It seems that most of what my ability to make OOD predictions or causal inferences is based on passive/offline learning. I know science from books/papers and not from running my own rigorous control experiments or RCTs.)
Cool post! Did you try seeing whether GPT-3 can regenerate parts of the Iris dataset (or any other datasets that may appear in its training data)? I’d also be interested to see finetuning results, results for the latest InstructGPT, and to see analysis of the GPT-3 Embeddings for integers and floats.
I think BIG-bench could be the final AI benchmark: if a language model surpasses the top human score on it, the model is an AGI.
Could you explain the reasoning behind this claim? Note that PaLM already beats the “human (Avg.)” on 150 tasks and the curve is not bending. (So is PaLM already an AGI?) It also looks like a scaled up Chinchilla would beat PaLM. It’s plausible that PaLM and Chinchilla could be improved by further finetuning and prompt engineering. Most tasks in BIG-Bench are multiple-choice, which is favorable to LMs (compared to generation). I’d guess that some tasks will leak into training data (despite the efforts of the authors to prevent this). Source for PaLM: https://arxiv.org/abs/2204.02311
I’m an author on TruthfulQA. They say GPT-4Chan gets 0.225 on our MC1 task. Random guessing gets 0.226. So their model is worse than random guessing. By contrast, Anthropic’s new model gets 0.31 (well above random guessing).
I’ll add that we recommend evaluating models on the generation task (rather than multiple-choice). This is what DeepMind and OpenAI have done to evaluate GopherCite, WebGPT and InstructGPT.
The indirect logit is trained with cross-entropy based on the groundtruth correct answer. You can’t do this for verbalized probability without using RL, and so we instead do supervised learning using the empirical accuracy for different question types as the labels.
We didn’t try but I would guess that finetuning on simple math questions wouldn’t help with Metaculus forecasting. The focus of our paper is more “express your own uncertainty using natural language” and less “get better at judgmental forecasting”. (Though some of the ideas in the paper might be useful in the forecasting domain.)
This is a brilliant comment for understanding the current deployment of DL. Deserves its own post.
It would be interesting to evaluate RETRO as it works differently from all the models we’ve evaluated. WebGPT is finetuned to use a search engine and it uses this (at inference time) to answer questions. This seems more powerful than the retrieval system for RETRO (based on a simple nearest neighbor lookup). So my speculation is that WebGPT would do better.
We don’t have plans to evaluate it but are open to the possibility (if the RETRO team was interested).