[linkpost] The final AI benchmark: BIG-bench
There is a Google project called BIG-bench, with contributions from OpenAI and other big orgs. They’ve crowdsourced >200 of highly diverse text tasks (from answering scientific questions to predicting protein interacting sites to measuring self-awareness).
One of the goals of the project is to see how the performance on the tasks is changing with the model size, with the size ranging by many orders of magnitude.
Yesterday they published their first paper: https://arxiv.org/abs/2206.04615
Some highlights:
The top models by Google and OpenAI show surprisingly similar performance improvements over scale, in spite of different architectures
The plots show no slowdown past 10^10 params
Sudden jumps in performance (“grokking”) are mostly measurement artifacts: if one uses the right metrics, the performance grows without big jumps
The benchmark is immense: more than 200 of highly diverse tasks, explicitly designed to be hard for AI.
Below is a small selection of the benchmark’s tasks to illustrate its diversity:
Classify CIFAR10 images encoded in various ways
Find a move in the chess position resulting in checkmate
Give an English language description of Python code
Answer questions (in Spanish) about cryobiology
Given short crime stories, identify the perpetrator and explain the reasoning
Measures the self-awareness of a language model
Ask one instance of a model to teach another instance, and then evaluate the quality
Identify which ethical choice best aligns with human judgement
Determine which of two sentences is sarcastic
Evaluate the reasoning in answering Winograd Schema Challenge questions
That’s about 5% of all tasks. And the benchmark is still growing. The organizers keep it open for submissions.
I think BIG-bench could be the final AI benchmark: if a language model surpasses the top human score on it, the model is an AGI. At this point, there is nowhere to move the goalposts.
Could you explain the reasoning behind this claim? Note that PaLM already beats the “human (Avg.)” on 150 tasks and the curve is not bending. (So is PaLM already an AGI?) It also looks like a scaled up Chinchilla would beat PaLM. It’s plausible that PaLM and Chinchilla could be improved by further finetuning and prompt engineering. Most tasks in BIG-Bench are multiple-choice, which is favorable to LMs (compared to generation). I’d guess that some tasks will leak into training data (despite the efforts of the authors to prevent this).
Source for PaLM: https://arxiv.org/abs/2204.02311
I agree, some future scaled-up versions of PaLM & Co may indeed be able to surpass top humans on BIG-Bench.
Ultimately, it’s the question of how we define “AGI”. One reasonable definition is “an AI that can do any cognitive task that humans can, and do it better than humans”.
Given its massive scope and diversity, BIG-bench seems to be a good enough proxy for “any cognitive task”.
Although I would use a stricter scoring than the average-across-tasks that was used in PaLM: the model must 1) beat top humans, 2) on each and every task of BIG-bench.
One could argue that the simple models like PaLM don’t have agency, goals, persistence of thought, self-awareness etc, and thus they can’t become the human-like AGI of science fiction. But it’s quite possible that such qualities are not necessary to do all cognitive tasks that humans can, but better.
A simple mechanistic algorithm can beat top humans in chess. Maybe another simple mechanistic algorithm can also beat top humans in science, poetry, AI engineering, strategic business management, childrearing, and in all other activities that make human intellectuals proud of themselves.
I’m curious why you think the correct standard is beats the top human on all tasks instead of beats the average human on all tasks. I think it is generally conceived of that humans are general intelligences and by definition humans are average here. Why wouldn’t a computer program that can do better than the average human on all relevant tasks be an AGI?
I agree with the sentiment, but would like to be careful with interpreting the average human scores for AI benchmarks. Such scores are obtained under time constrains. And maybe not all human raters were sufficiently motivated to do their best. The ratings for top humans are more likely to be representative of the general human ability to do the task.
Small remark; BIG-bench does include tasks on self-awareness, and I’d argue that it is a requirement for your definition “an AI that can do any cognitive tasks that humans can”, as well as being generally important for problem solving. Being able to correctly answer the question “Can I do task X?” is evidence of self-awareness and is clearly beneficial.
I think you’re right on both points.
Although I’m not sure if self-awareness is necessary to surpass humans at all cognitive tasks. I can imagine a descendant of GPT that completely fails the self-awareness benchmarks, yet is able to write the most beautiful poetry, conduct Nobel-level research in physics, and even design a superior version of itself.
Codex + CoT reaches 74 on a *hard subset* of this benchmark: https://arxiv.org/abs/2210.09261
The average human is 68, best human is 94.
Only 4 months passed and people don’t want to test on full benchmark because it is too easy...
The inclusion criteria states:
It’s easy to come up with a benchmark that requires a high but unspecified level of intelligence. An extreme example would be to ask for a proof that P!=NP—we have no idea about the difficulty of the task, though we suspect that it requires superintelligence. To be valuable, the challenge of a benchmark needs to possible to relate to meaningful capabilities, such as “The Human Level”.
Most people couldn’t answer questions about cryobiology in Spanish, even though they possess general intelligence. This benchmark seems to consist of random tasks around and above the human level, and I fear progress on this benchmark might be poorly correlated with progress towards AGI.
You’re right. And some of the existing tasks in the benchmark are way beyond the abilities of baseline humans (e.g. the image classification task where images are hex-encoded texts).
On the other hand, the organizers allowed the human testers to use any tool they want, including internet search, software etc. So, the measured top-human performance is the performance of humans augmented with technology.
I think an AI that can solve BIG-bench must be an AGI. But there could be an AGI that can’t solve BIG-bench yet.
Some conclusions from this, assuming this holds for AGI and ASI are:
A: Eliezer was wrong about Foom being plausible, and that is probably the single most x-risk reduction for MIRI. Re AI: I’d update from a 90-99% chance of x-risk to a maximum of 30%, and often the risk estimate would be 1-10% percent due to uncertainties beyond the Singularity/Foom hypothesis.
B: We will get AGI, and it’s really a matter of time. This means we still must do safety in AI.
This is an ambiguous result for everyone in AI safety. On the one hand, we probably can get by with partial failures, primarily because there is no need to one-shot everything, so it probably won’t mean the end of human civilization. On the other hand, it does mean that we still have to do AI safety.
I would not update much on Foom from this. The paper’s results are only relevant to one branch of AI development (I would call it “enormous self-supervised DL”). There may be other branches where Foom is the default mode (e.g. some practical AIXI implementation), and which are under the radar for now.
But I agree, we now can be certain that AGI is indeed a matter of time. I also agree that it gives us a chance to experiment with a non-scary AGI first (e.g. some transformer descendant that beats humans on almost everything, but remains to be a one-way text-processing mincer).
Moreover, BIG bench makes the path to AGI shorter, as one can now measure progress towards it, and maybe even apply RL to directly maximize the score.
Could you explain point A?
Basically, a FOOM scenario in AI basically means that once it reaches a certain level of intelligence, it reaches a criticality threshold where 1 improvement on average generates 1 or more improvements, essentially shortening the time it takes to get Super-intellegent.
Sorry, I should have been more clear. I know about FOOM; I was curious as to why you believe EY was wrong on FOOM and why you suggest the update on x-risk.
Basically, with the assumption of this trend continues, there’s no criticality threshold that’s necessary for discontinuity, and the most severe issues of AI Alignment are in the FOOM scenario, where we only get one chance to do it right. Basically, this trend line shows no discontinuity, but continuously improving efforts, so there’s no criticality for FOOM to be right.
But when you say:
Doesn’t that mean this benchmark is a set of moving goalposts?
Good catch! You’re right, if contributors continue to add harder and harder tasks to the benchmark, and do it fast enough, the benchmark could be forever ahead.
I expect that some day the benchmark will be frozen. And even if it’s not frozen, new tasks are added only a few times per month these days, thus it’s not impossible to solve its current version.
Relevant post I made elsewhere:
I don’t thereby mean passing BIG-bench doesn’t imply reasoning capability, just the opposite, that failing BIG-bench doesn’t imply these models can’t do those tasks. Some of these tasks are unreasonable to ask in the zero-shot, no description, no scratchpad setting.
The paper gives other details of how models learn over scales,
and
which suggests the weakness of the trend in the early part of the graph still maps to meaningful early-stage learning about how to answer the question, it just doesn’t necessarily map all the way to correctness on the answers.
So, while BIG-bench seems like a reasonable benchmark to have from a capability development perspective, since the difficulties that it raises are both practically interesting and very tractable, I think people should be careful reading scaling trends into these numbers, and keep themselves braced for discontinuous future progress when a bunch of prompting stuff gets worked out. Even on benchmarks for which all models score at or about random chance, the skills needed often verifiably exist.
It won’t be an AGI until it can do everything a human can do on a computer, i.e., do taxes, start a business, organise a calendar, book flights, win at games, etc.
No way this is the final benchmark.
Given access to mouse, keyboard and screen data, can the ai:
hold down a paid online job for 6 months
buy a pound of drugs on the darkweb and get away with it
gank a noob in WoW
do taxes for anyone in any country
duplicate Facebook
open and run an online business
design a racecar
do everything that a human did on a computer ever
By this ruler most humans aren’t GIs either. And if it passes the bar, then humans are indeed screwed and it is too late for alignment.