The Limitations of GPT-4

Amidst the rumours about a new breakthrough at OpenAI I thought I’d better publish this draft before it gets completely overtaken by reality. It is essentially a collection of “gaps” between GPT4 and the human mind. Unfortunately the rumours around Q* force me to change the conclusion from “very short timelines seem unlikely” to “who the f**k knows”.

While GPT4 has a superhuman breadth of knowledge, writing speed and short-term memory, compared to the human mind GPT4 has a number of important limitations.

Some of these will be overcome in the near future because they depend on engineering and training data choices. Others seem more fundamental to me, because they are due to the model architecture and the training setup.

These fundamental limitations are the reason why I do not expect scaling GPT further to lead to AGI. In fact, I interpret further scaling of the exact current paradigm as evidence that overcoming these limitations is hard.

I expect scaled up GPT4 to exhibit the same strength and weaknesses and for the improved strengths to paper over the old weaknesses at most in a superficial fashion.

I also expect with further scaling the tasks GPT is unable to do to increasingly load on the fundamental limitations and therefore diminishing returns.

This list is not exhaustive and there are likely ways to frame even the limitations I identify in a more insightful or fruitful way. For example, I am not sure how to interpret GPT4s curious inability to understand humor. I hope further limitations will be mentioned in the comments.

Integration of the Senses

GPT4 cannot really hear, and it cannot really talk.

Voice input is transcribed into text by a separate model, ‘Whisper,’ and then fed to GPT4. The output is read by another model. This process loses the nuances of input—pronunciation, emphasis, emotion, accent, etc. Likewise, the output cannot be modulated by GPT4 in terms of speed, rhythm, melody, singing, emphasis, accent, or onomatopoeia.

In fact, due to tokenization it would be fair to say that GPT4 also cannot really read. All information about char-level structure that is relevant to spelling, rhyming, pronunciation must be inferred during training.

The vision component of most open multi-modal models is typically likewise “grafted on.” That is, it’s partially trained separately and then connected and fine-tuned with the large language model (LLM), for example by using the CLIP model, which maps images and their descriptions to the same vector space.

This means GPT4 may not access the exact position of objects or specific details and cannot “take a closer look” as humans would.

I expect these limitations to largely vanish as models are scaled up and trained end-to-end on a large variety of modalities.

System 2 Thinking

Humans not only think quickly and intuitively but also engage in slow, reflective thinking to process complex issues. GPT-4′s architecture is not meaningfully recurrent; it has a limited number of processing steps for each token, putting a hard cap on sequential thought.

This contrast with human cognition is most evident in GPT4′s unreliable counting ability. But it also shows up in many other tasks. The lack of system 2 thinking may be the most fundamental limitation of current large language models.

Learning during Problem Solving

Humans rewire their brains through thinking; synapses are continuously formed or broken down. When we suddenly understand something, that realization often lasts a lifetime. GPT4, once trained, does not change during use.

It doesn’t learn from its mistakes nor from correctly solved problems. It notably lacks an optimization step in problem-solving that would ensure previously unsolvable problems can be solved and that this problem-solving ability persists.

The fundamental difference here, is that in humans the correct representations for a given problem are worked out during the problem-solving process and then usually persist – GPT4 relies on the representations learned during training, new problems stay out of reach.

Even retraining doesn’t solve this issue because it would require many similar problems and their solutions for GPT4 to learn the necessary representations.

Compositionality and Extrapolation

Some theories suggest that the human neocortex, the seat of intelligence, uses most of its capacity to model the interplay of objects, parts, concepts, and sub-concepts. This ability to abstractly model the interplay of parts allows for better extrapolation and learning from significantly less data.

In contrast, GPT-4 learns the statistical interplay between words. Small changes in vocabulary can significantly influence its output. It requires a vast amount of data to learn connections due to a lack of inductive bias for compositionality.

Limitations due to the Training Setup

Things not present in the training data are beyond the model’s learning capacity, including many visual or acoustic phenomena and especially physical interaction with the world.

GPT-4 does not possess a physical, mechanical, or intuitive understanding of many world aspects. The world is full of details that become apparent only when one tries to perform tasks within it. Humans learn from their interaction with the world and are evolutionarily designed to act within it. GPT-4 models data, and there is nothing beyond data for it.

This results in a lack of consistency in decisions, the ability to robustly pursue goals, and the understanding or even the need to change things in the world. The input stands alone and does not represent real-world situations.

GPT-4′s causal knowledge is merely meta-knowledge stored in text. Learning causal models of new systems would require interaction with the system and feedback from it. Due to this missing feedback, there is little optimization pressure against hallucinations.

Conclusion

Some of these points probably interact or will be solved by the same innovation. System 2 thinking is probably necessary to move the parts of concepts around while looking for a solution to a problem.

The limitations due to the training setup might be solved with a different one. But that means forgoing cheap and plentiful data. The ability to learn from little data will be required to learn from modalities other than abundant and information-dense text.

It is very unclear to me how difficult these problems are to solve. But I also haven’t seen realistic approaches to tackle them. Every passing year makes it more likely that these problems are hard to solve.

Very short timelines seemed unlikely to me when I wrote this post, but Q* could conceivably solve “system 2 thinking” and/​or “learning during problem solving” which might be enough to put GPT5 over the threshold of “competent human” in many domains.