Benchmarks for Comparing Human and AI Intelligence

Epistemic status: Been thinking about this for about a week, and done some basic googling.

Intro

Creating benchmarks for measuring and comparing the intelligence of AIs to humans seems useful for tracking progression and making future predictions of development.

This article summarizes some existing benchmarks, and reasons around what new benchmarks could potentially be useful.

Please share your own thoughts in the comments, it’s much appreciated.

Benchmark 1 - Coding

Performance on coding problems could be a useful benchmark for three reasons:

A lot of resources are being put into coding AI’s, which means the benchmark get frequently used with cutting edge models.
Coding is a complex type of problem solving, which type of problem solving skills in part might represent how good the AI could be at other tasks.
If an AI surpasses all humans on coding tasks, it will not unlikely be good enough for self improvements, meaning it would be useful to predict the future development of AI’s coding ability

Currently, AlphaCode by DeepMind is the most prominent code solving AI, performing better than 44% of programmers, who are likely significantly above what the average human would perform even with programming experience.

Benchmark 2 - Number Sequences

A good number sequence should test the ability of recognizing novel patterns. If each number series in a test contains a unique and novel pattern, it should be some sort of proxy of the reasoning ability.

I didn’t find any good existing benchmarks on this, so I decided to test ChatGPT on this test I found online. I gave ChatGPT each number sequence, and it had to guess the correct answer. I did not give it the answer suggestions, and if it answered something that was not among the suggestions, I picked the top option, that was correct two times of the times it didn’t give an answer among the alternatives.

The AI got ¹¹⁄₂₀ correct, (of which 2 answers were lucky guesses since I picked the top alternative), which the site claims to be equivalent to an IQ of 115 IQ. I doubt that is the case, but if it was true, it means the AI is better at numerical sequences than 84% of humans. I find it more likely that the AI is better than somewhere around 20-30% of people on number sequences (given a limited time of 20 minutes for 20 questions).

Benchmark 3 - Big Bench by Google

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their future capabilities. More than 200 tasks are included in BIG-bench. PaLM with 5 shot learning slightly outperforms the average human.

Ideas of Further Benchmarks

General reasoning test like the one used on PaLM could be interesting. From anecdotally asking intelligent people some of the questions asked by PaLM, I assign a high probability that PaLM outperforms at least 30% of all humans on that type of task.
Turing test using different age groups. The Turing Test measures very general intelligence, since a very wide variety of questions can be asked.

Adults could be chatting with either a child, or an AI, and have to determine which one is the child and which is the AI. AI progression could then be tracked by tracking which age group it can imitate well.

Final Thoughts

Benchmarks suggest that AI on most reasoning, problem solving and general intelligence tasks already outperforms a significant part of the population. However, AI’s often are portrayed as stupid, for making what looks like obvious mistakes to humans, and the fact that they have no long term memory. I argue that this viewpoint is incorrect, and that AI’s are more intelligent than commonly portrayed, but simply are not prompted/trained correctly to give the type of answer a human would expect. In my tests, ChatGPT is not necessarily smarter than the first version of GPT-3, but it is certainly more human friendly and can thus seem more intelligent.

If we truly have gone from AI being smarter than a fraction of the population, to being smarter than a large chunk of the population, it might indicate that shorter AGI timelines are more likely.

Please leave comments with suggestions on other benchmarks, feedback on the post, or any other thoughts you’d like to share.

Keyboard shortcuts

Keys shown in yellow (e.g., ]) are accesskeys, and require a browser-specific modifier key (or keys).

Keys shown in grey (e.g., ?) do not require any modifier keys.

General
? Show keyboard shortcuts
Esc Hide keyboard shortcuts

Site navigation
h Go to Home (a.k.a. “Frontpage”) view
f Go to Featured (a.k.a. “Curated”) view
a Go to All (a.k.a. “Community”) view
m Go to Meta view
v Go to Tags view
c Go to Recent Comments view
r Go to Archive view
q Go to Sequences view
t Go to About page
u Go to User or Login page
o Go to Inbox page

Page navigation
, Jump up to top of page
. Jump down to bottom of page
/ Jump to top of comments section
s Search

Page actions
n New post or comment
e Edit current post

Post/comment list views
. Focus next entry in list
, Focus previous entry in list
; Cycle between links in focused entry
Enter Go to currently focused entry
Esc Unfocus currently focused entry
] Go to next page
[ Go to previous page
\ Go to first page
e Edit currently focused post

Editor
k Bold text
i Italic text
l Insert hyperlink
q Blockquote text

Appearance
= Increase text size
- Decrease text size
0 Reset to default text size
′ Cycle through content width settings
1 Switch to default theme [A]
2 Switch to dark theme [B]
3 Switch to grey theme [C]
4 Switch to ultramodern theme [D]
5 Switch to simple theme [E]
6 Switch to brutalist theme [F]
7 Switch to ReadTheSequences theme [G]
8 Switch to classic Less Wrong theme [H]
9 Switch to modern Less Wrong theme [I]
; Open theme tweaker
Enter Save changes and close theme tweaker
Esc Close theme tweaker (without saving)

Slide shows
l Start/resume slideshow
Esc Exit slideshow
→↓ Next slide
←↑ Previous slide
Space Reset slide zoom

Miscellaneous
x Switch to next view on user page
z Switch to previous view on user page
` Toggle compact comment list view
g Toggle anti-kibitzer