Contra Alexander on the Bitter Lesson and IQ

Link post

A recent post from Scott Alexander argues in favor of treating intelligence as a coherent and somewhat monolithic concept. Especially when thinking about ML, the post says, it is useful to think of intelligence as a quite general faculty rather than a set of narrow abilities. I encourage you to read the full post if you haven’t already.

Now we attempt to answer questions such as:

  • Who’s stronger: Albert Einstein or a baby?

  • Who’s better at football: GPT4 or a trained monkey?

  • Who’s smarter: LLMs or self-driving cars?

Football Quotient (FQ)

But first, before I talk about IQ, let me introduce something called FQ. That’s the “Football Quotient” (or the “soccer quotient” for us Americans).

Imagine you wanted to know how good someone was at football, so you took all the important football things like dribbling and shooting goals and you looked at how good they were at those things. Then you could add up how good they are at all those different skills and get a number that sort of represents how good they are at football. And some people will have a higher football quotient than others.

But it doesn’t really tell the whole story. Maybe it doesn’t capture goalies very well, or it doesn’t capture people with athletic talent but no training. Maybe one person has a high FQ because they’re tall and they can head the ball really well, but they aren’t as good at free throws, while another player is fast and scores well, but they’re rubbish on defense. Two players might have the same FQ while having different skills.

But it’s still a pretty good measure of overall football skills, and it probably explains a great deal of the variation between people in how good they are at football. It’s even probably applicable outside of football, so you could use the FQ to evaluate basketball players and get pretty decent results, even though it wasn’t designed for that. It’s a useful test even though it isn’t perfect.

IQ is like that.

Tests

The examples that Scott chooses are intended to support the idea that intelligence is a broad, and that the g factor is real, coherent, and useful. But I think these examples obscure more than they elucidate.

A primary example of intelligence as a general-use capability is that scores on the math and verbal sections of the SAT correlate quite strongly at a value of 0.72. As Scott notes, maybe some people are just better test takers than others, or have more access to tutoring or a healthier diet in childhood, but nonetheless it does seem like “test taking ability” is captured by a measurement of intelligence in humans.

The math and written sections of the SAT are quite similar to each other. They both rely on linguistic reasoning and knowledge, to some extent, and they both rely on language as the questioning and answering modality. It’s true that if a human or large language model does well on the verbal section, we expect it to do similarly well on the math section, with some correlation.

But what about other pairs of tests that are quite similar?

For example, let’s go back to the FQ (football quotient). Imagine it has a written portion of the test and a physical portion of the test. Among American adults, my guess is that there’s a moderate correlation between knowledge of the rules of football and ability to actually play football. A lot of people learn the rules because they go out and play games, and the people who like to watch on TV probably like to go out and kick the ball around sometimes.

But now let’s talk about AI models. I’d guess that GPT4 would do quite well on the written portion of the FQ test; it might score in the 99th percentile on a hard test. But it’s not even able to participate in the physical test; it gets zero points on disqualification.

On the other hand, a trained monkey would bomb the written portion even if it did okay on the physical section.

trained monkey playing football

Is this just because GPT4 doesn’t have a body?

No. Once we get outside the realm of a written test, GPT4 starts to fail at all sorts of things. For example, GPT4 currently (as of publication) lacks a long term memory store, so it’s not able to handle tasks that require a long memory. It might be able to ace the bar exam, but it can’t represent a client over multiple interactions.

Another example comes from Voyager, a paper in which they use a coding agent to play Minecraft. It’s eventually able to achieve impressive goals like mining diamonds, but it can’t build a house because it has an impoverished visual system.

Once we’re talking about these quite different tasks, we start to have less expectation of correlated performance among AIs.

Why G Factor Exists in Humans but not Necessarily in AIs

A large reason for the existence of the g factor in humans is that humans all have approximately the same architecture. The large majority of humans (although not all) have a working visual system, a functioning long term memory, and a working grasp of language. Furthermore, most humans have mastered basic motor control, and we have decent intuitions about physical objects. Most humans are able to set and pursue at least simple goals and act to try to maintain their personal comfort.

On the other hand, AI models can have vastly different architectures! A large language model doesn’t necessarily have multimodal visual input, or motor output, or a long term memory, or the ability to use tools, or any notion of goal pursuit.

When it comes to AI models, it’s necessary to break up the concept of intelligence because AI models are composed of multiple multiple distinct functions. It’s very easy to get an AI that passes some sorts of tests but fails others.

Which is smarter, GPT3 or a self-driving car? This isn’t a good question; their architectures are too different.

It’s wrong to think that tests of IQ are different from each other because they test different types of subject matter, rather than being functionally different.

It’s also wrong to take model architecture for granted and assume that an AI model will always have a baseline ability to participate in relevant tests.

Sure, the concept of “intelligence” is useful if you want to think about the differences between GPT2 and GPT4. But it breaks down when you’re thinking about AI models with different capabilities.

So is IQ Useful for AIs?

What’s the goal for ML researchers at this moment? Is it to make smarter models with a higher IQ? Or is it to make more broadly capable models that can do things that GPT4 can’t?

If you’re excited about continuing to minimize next word prediction error, then you will probably find IQ to be a useful concept.

But if you find it unsatisfactory that AI models don’t have memory or agency or vision or motor skills, then you probably want to use a multi-factor model of intelligence rather than a generalized quotient.