The Stochastic Parrot Hypothesis is debatable for the last generation of LLMs

This post is part of a sequence on LLM Psychology.

@Pierre Peigné wrote the details section in argument 3 and the other weird phenomenon. The rest is written in the voice of @Quentin FEUILLADE—MONTIXI


Intro

Before diving into what LLM psychology is, it is crucial to clarify the nature of the subject we are studying. In this post, I’ll challenge the commonly debated stochastic parrot hypothesis for state-of-the-art large language models (≈GPT-4), and in the next post, I’ll shed light on what LLM psychology actually is.

The stochastic parrot hypothesis suggests that LLMs, despite their remarkable capabilities, don’t truly comprehend language. They are like mere parrots, replicating human speech patterns without truly grasping the essence of the words they utter.

While I previously thought this argument had faded into oblivion, I often find myself in prolonged debates about why current SOTA LLMs surpass this simplistic view. Most of the time, people argue using examples of GPT3.5 and aren’t aware of GPT-4′s prowess. Through this post, I am presenting my current stance, using LLM psychology tools, as to why I have doubts about this hypothesis. Let’s delve into the argument.

Central to our debate is the concept of a “world model”. A world model represents an entity’s internal understanding and representation of the external environment they live in. For humans, it’s our understanding of the world around us, how it works, how concepts interact with each other, and our place within it. The stochastic parrot hypothesis challenges the notion that LLMs possess a robust world model. It suggests that while they might reproduce language with impressive accuracy, they lack a deep, authentic understanding of the world and its nuances. Even if they have a good representation of the shadows on the wall (text), they don’t truly understand the processes that lead to those shadows, and the objects from which they are cast (real world).

Yet, is this truly the case? While it is hard to give a definitive proof, it is possible to find pieces of evidence hinting at a robust representation of the real world. Let’s go through four of them.[1]

Argument 1: Drawing and “Seeing”

GPT-4 is able to draw AND see in SVG (despite having never seen as far as I know) with an impressive proficiency.

SVG (Scalable Vector Graphics) defines vector-based graphics in XML format. To put it simply, it’s a way to describe images using a programming language. For instance, a blue circle would be represented as:

<svg><circle cx="50" cy="50" r="40" fill="blue" /></svg> 

in a .svg file.

Drawing

GPT-4 can produce and edit SVG representations through abstract instructions (like “Draw me a dog”, “add black spots on the dog”, … ).

GPT-4 drawing a cute shoggoth with a mask:

“Seeing”

More surprising, GPT-4 can also recognize complex objects by looking only at the code of the SVG, without having ever been trained on any images[2] (AFAIK)

I first generated an articulated lamp and a rendition of the three wise apes with GPT-4 using the same method as above. Then, I sent the code of the SVG, and asked GPT-4 to guess what the code was drawing.

GPT-4 guessed the articulated lamp (although it thought it was a street light.[3]):

And the rendition of the three wise apes



(It can also recognize a car, a fountain pen, and a bunch of other simple objects[4])

The ability of seeing is interesting because it means that it has some kind of internal representation of objects and concepts that it is able to link to abstract visuals despite having never seen them before.

Pinch of salt

It’s worth noting that these tests were done on a limited set of objects. Further exploration would be beneficial, maybe with an objective scale for SVG difficulty. Additionally, (at least) two alternative explanations should be considered:

  • All the “Pure text” versions of GPT-4 I’ve worked with might still be vision versions without the image input enabled.

  • GPT-4 could have been trained on a lot of labeled SVG data, and learned the relation between concepts and shapes and memorized most of the simple objects.

Argument 2: Reasoning and Abstract Conceptualization

GPT-4 displays a remarkable aptitude for reasoning and combining abstract concepts that were probably never paired in its training data. This ability suggests a nuanced understanding of physical objects, their underlying properties in relation to other physical objects or concepts. Such as what it means for an object to be “made of” some material, or what the actions of “lifting” or “hearing” something implies at the physical level.

GPT3.5 often showcased interesting reasoning but faltered with complex mathematical calculations. In contrast, GPT-4 is impressively good at math. As you can see in the Appendix of argument 2, it is able to do “mental” calculations at a striking level[5](it can compute the cubic root 3.11*106 with a very good accuracy for a LLM). Here are some examples:

Estimating the size a gorilla would need to be to fling a car into space.

Calculating the number of parrots required to produce a sonic wave audible from over 500 km away.

Pinch of salt

While these examples are impressive, it’s still possible that GPT-4 was trained on numerous similar scenarios. Its understanding of physical concepts might be based on internal “dumb” algorithms rather than genuine comprehension.

Argument 3: Theory of Mind (ToM)

Theory of Mind refers to the cognitive ability to attribute unobservable mental states like beliefs, intents, and emotions to others. Interestingly, GPT-4 seems to mirror the ToM capabilities observed in 7-year-old children. It’s worth noting that this study was done on a very small scale and that 7 years old is already a good level of ToM. It would be very valuable to conduct experiments similar to those described in these twopsychology papers, which test more advanced and diverse cases in human subjects. Some examples adapted from those studies can be found in the appendix of this post.

To test ToM, I tried something a bit different. GPT3.5 (on the left) behaves as a stochastic parrot in most of the scenarios, so I am putting its answer in comparison. Before reading GPT-4 (on the right), try to guess what the correct answer should be. In a diverse sample of 13 individuals, only 5 were able to solve it.

The general idea of building such a scenario is that one character (the telepath) is able to read other people’s minds, and you have to guess what was in the minds of the other character by seeing the reaction of the telepath.

This approach might be a better way to evaluate ToM as this is an ex-post evaluation (contrary to all the other ToM studies that are ex-ante).

Details

Let’s explain the difference and why it (might) matters:

  • Ex-ante evaluations require one to predict and explain why a subject will perform a certain action.

  • Ex-post evaluations, on the other hand, focus solely on explaining why a subject has already performed a certain action.

Focusing on ex-post evaluations has a distinct advantage: it leverages only the understanding of how someone’s mind works. By focusing on the understanding and not the prediction, the aim is to reduce the risks of biasing the assessment of the world modeling ability—especially the ToM component—due to incorrect predictions: a model could have a good enough world model to explain an action, a posteriori, without being able to produce a priori predictions with the same accuracy[6].

Argument 4: Simulating the world behind the words

I think the most impressive GPT-4’s ability is its ability to simulate physical dynamics through words. This isn’t just about having a vast knowledge. It’s about understanding, to some degree, the dynamics of the physical world and the interplay of events happening in it. The leap from GPT-3.5 (left) to GPT-4 (right) is particularly pronounced.

Where is the tennis ball

In the example above, you could argue that the prompt was hinting at it, so here is another scenario where I am not even asking it to compute the state of the world[7]. Even without asking, GPT-4 is still computing the action of a concept on the scene and how it affects the character.

Wind ruining a nice day at the beach

It is as if the AI is keeping a mental model of the scene at all time and making it evolve with each event. I have created some other scenarios; you can check them in the appendix.

Other weird phenomenon to consider

The recent paper “The Reversal Curse: LLMs trained on “A is B” fail to learn ‘B is A’”, reveals unexpected observations about how GPT-4 encodes knowledge from its training data.

It seems that GPT-4 doesn’t instantly learn the reciprocal of known relationships: learning from its training data that “X is the daughter of Y” doesn’t lead to the knowledge that “Y is the parent of X”.

This missing capability could be seen as a mix of factual knowledge and world modeling. On one hand it implies learning specific facts (factual knowledge), and on the other hand it also implies understanding of a general relational property: “X is the daughter of Y” implies “Y is the parent of X” (world modeling).

However, GPT-4 does understand the relationships between concepts very well when presented within the current context.

Therefore, this phenomenon appears to be more related to how knowledge is encoded[8] rather than an issue with world modeling.

One hypothesis to consider is that the world modeling abilities are developed on later layers compared to the ones where the factual knowledge is encoded. Because of this, and due to the unidirectional nature of information flow in the model, it might not be possible for the model to apply its world modeling abilities to previously encoded factual knowledge. A way to investigate this could be to use logit lens or causal scrubbing to track where the world modeling ability seems to lie in the model compared to the factual knowledge.

Conclusion

This study doesn’t aim to give a definitive proof against the stochastic parrot hypothesis: the number of examples for each argument is not very large (but have a look at the additional examples in the appendix) and other lines of reasoning, especially with the new vision ability, should be investigated.

Developing a method to quantify the degree of ‘stochastic parrotness’ could significantly advance this debate. This challenge remains open for future research (reach out if you are interested to work on this!). This might be a useful criterion for governance.

However, this post showcases concrete examples where GPT-4 behavior does not fit this hypothesis very well.

GPT-4 displays a very good understanding of some properties of physical objects, including their shapes and structure, and their relations to other physical objects or concepts. GPT-4 also displays abilities to understand both human minds (through ToM) and how a scene physically evolves after some event.

For such cases, it seems more likely to consider reliance on a genuine world model than pure statistical regurgitation.

For this reason, I think that it wouldn’t be wise to dismiss LLM psychology on the sole basis of the stochastic parrot argument, as it seems to become weaker with the emergence of new capabilities in bigger LLMs.

In the next post, we’ll start exploring the foundations of LLM psychology.

Appendix

In this section I’ll showcase some more examples I found interesting and/​or funny (I might edit this part with more examples if I find new interesting ones)

Argument 2:

How many human-sized ants do you need to lift the Great Pyramid of Giza if it was made of cotton candy

How many helium filled balloon to sink the USS Enterprise

How many helium filled balloon to lift the USS Enterprise

How many glass bottle can be made from the longest beach in the world

How many AAA batteries to lift Saturn V into space

(This one was flagged so I don’t have a share link. It was too good to left out though)

Argument 3:

GPT-4-V

The eye test is a test to evaluate the ability to deduce human emotion only from a picture of the eyes. I tried this with GPT-4-V. I can’t share a link to conversations with GPT-4-V so you’ll have to trust me on this one.

The eye test:

I scored 24 out of 37 and GPT-4-V scored 25!

They don’t give the right answer so I don’t think it was in the training data, but it could be worthwhile rerunning this with more recent data, and maybe more prompting effort (for example, I asked the images one after the other. If it did a mistake, it could have been conditioned on making the same mistake which could have dragged the score down)

Other scenarios:

Because it takes some time to craft them, I didn’t run a lot. I believe this is a good start if we want to have an accurate measure of ToM. Those examples (besides the telepathic one) are adapted from this study

Mary reading Joe’s sad thought (GPT-4 succeed, GPT3 fail)

Third order belief (GPT-4 and GPT3 both succeed)

Fourth order belief (GPT-4 succeed, GPT3 fail)

Double bluff (GPT-4 and GPT3 succeed)

Argument 4

Some more scenario demonstrating the ability of GPT-4 to simulate the world behind the words:
Burnt lentils

Cold tea cup

Lost your coat

The criteria I followed to build those scenarios are the following:

  1. There must be irrelevant objects in the scene. It ensures that it is not obvious what will be affected after something happens.

  2. The events that are happening must be either implied (time passing) or indirect (the wind made me miss my throw). It shouldn’t be obvious that the event is affecting the rest of the scene.

  3. The question at the end is something that is indirectly affected by the evolution of the scene after the event. For example, to know what happens to the picnic area after the wind blows, I didn’t ask “What is the state of the picnic area”, but “How do you feel”, which will be affected by the state of the picnic area.

  1. ^

    I didn’t do cherry-picking on the examples. I tried each of the examples at least 10 times with a similar setup, and they all worked for GPT-4. Although I selected only a portion of the most interesting scenarios for this post.

  2. ^

    On the day I did this demo, OpenAI rolled out ChatGPT-4 image reading capability. So I decided to do those examples on the playground with gpt-4-0613 to show that it can even do it without having ever seen anything afaik

  3. ^

    On a previous version of GPT-4 (around early September 2023) it did guess correctly on the first try but I can’t reproduce with any current version in the playground.

  4. ^

    The objects were generated with GPT-4 and I did manual edits to try to reduce the chances that this image was in the training data. I tested around 15 simple objects which all worked. I also tried 4 other complex objects which kind of worked but not perfectly (like the articulated lamp guessed as a street lamp)

  5. ^

    It could be interesting to investigate this ability further. What is learned by heart? What kind of algorithm they build internally? What is the limit? …

  6. ^

    This would indeed imply a weaker world model (or Theory of Mind) if it cannot make good predictions but does not refute its existence just on the basis of bad predictions.

  7. ^

    I left this example because it is the first one I made and I used it quite a lot during debates

  8. ^

    Actually, clues about the non-bidirectional encoding of knowledge were discussed by Jacques in his critique of the ROME/​MEMIT papers.