What would it mean to understand how a large language model (LLM) works? Some quick notes.

Cross-posted from New Savanna.

I don’t mean “understand” in any deep philosophical sense. I mean only a rough and ready sense of the word. We understand how toasters work, automobiles, moon rockets, digital computers, and so forth. We know how to design and construct these things, how to diagnose problems, how to maintain and repair them. Not perfectly to be sure, but well enough to use these devices to get things done.

LLMs, however, are said to be opaque. We don’t know how they work. We feed them prompts, they produce output, but how the model works from the prompt to produce the output, that’s mysterious. There are people working on mechanical interpretability, trying to understand the LLM as though it were a machine, or at least, a computer program of the ordinary kind, where we know, more or less, how it works on data – if it is the kind of program that works from data – to produce output. But what would it mean to understand the operational characteristics of 175 billion parameters, as in the case of GPT-3.5?

It means, I suppose, how those parameters mediate between the input, a prompt, and the output, whatever “follows from” a given prompt. At the lowest level we are told that LLMs are prediction machines. So, the output string is simply a continuation of the input string. And I suppose that, technically, that’s true. But it’s not very helpful, as I’ve argued at some length.

Let’s set that aside.

What could we possibly want by way of understanding?

We’ve got three things: There is the underlying engine, let’s call it, which is a computer program like any other. It’s created by programmers working with some language or languages and is designed to achieve a certain purpose. In this case, it’s designed to create a language model over a corpus of texts and then to use that model in generating new chunks of language given an input prompt.

It’s that model that’s problematic, that’s said to be opaque. We, us humans, didn’t create that model. The engine did. And, in the case of GPT-3, that model’s got 175 billion parameters. More recent models have even more. And there are also models with only millions of parameters. But even those smaller models are huge.

But, here’s the thing, how can we understand how that opaque model operates unless we understanding what it’s trying to do? Sure, we can pop the hood and take a look. We see a bunch of gizmos, widgets, framblasts, and other things, but so what? They’re just whirling around, engaging with one another, in intricate patterns? But what are they trying to do? We know what car engines are supposed to do; they supply power to the wheels (and the wheels move the car).

Well, LLMs are supposed to produce language – and computer code and math as well, but let’s stick with ordinary language for the purposes of these notes. But, alas, the mechanisms of language are themselves opaque. The relationship between cars wheels and car motion is transparent. The relationship between nouns and verbs and adjectives and prepositions and sentences and, you know, knowledge, understanding, entertainment, the things language is for, those relationships are not so obvious.

Of course, linguists have been working on language mechanisms for years. But it’s not at all clear what the field has come up with. There are major disagreements on how one is to understand syntax. And when we move beyond sentences to discourse of various kinds, we know even less about mechanisms.

I figure that there’s almost zero chance that we’re going to find those mechanisms by mucking around in LLMs. Yes, I know that LLMs are quite different from the human brain and mind. But, the fact is, LLMs do a very convincing imitation of human language. Given the complexity of language, they wouldn’t be able to do that if they hadn’t absorbed some (perhaps) useful approximation to human mechanisms. I’m willing to proceed on the default understanding that, whatever the model is doing, it has some resemblance to what humans do. If I make that assumption, that gives me some tools to think with. Without it, I got nothing.

Still, a grammar is a large and complex thing. The Cambridge Grammar of the English Language is 1860 pages long, and it is merely a descriptive grammar and not meant to account for the underlying mechanisms, however they might best be characterized. Is that what we want from a mechanistic understanding of an LLM? And that only gets us sentences. What about paragraphs, stories, histories, repair manuals, accounts of exotic astronomical objects, and who knows what else? Do we expect students of mechanistic interpretability to eventually give us detailed accounts of such wonders?

Understanding stories

What would it mean to understand how ChatGPT tells stories?

This morning I logged onto ChatGPT, not GPT Plus, just plain old ChatGPT, and prompted it with one word: “Story.” What do you think it did? Right, it told me a story. The story began with this sentence: “Once upon a time, in a quaint little village nestled at the foot of a towering mountain range, there lived a young girl named Lily.” I don’t think it’s very useful to think of that sentence as the natural continuation of a string beginning with the word, “story.” Yes, I know, I’m not prompting the “naked” underling LLM. ChatGPT has been prompt-engineered and RLHFed (RLHF: reinforcement learning with human feedback) to death to be a congenial conversational partner. But that doesn’t change the basic situation.

In this case, the situation is that, in some sense, ChatGPT “knows” what a story is and knows how to tell one. By this time I’ve prompted it to produce 100s, though probably not yet 1000s of stories. In a few cases the prompt was just that one word. More often it was something like one of these:

Tell me a story.
Tell me a story about a hero.
Tell me a realistic story.
Tell me a true story about a hero.

ChatGPT also told me a well-formed story. The stories were relatively short and simple, and the first two prompts produced stories with a fairytale feel, supernatural creatures and events were typical. Those were absent in realistic stories. As for true stories, sometimes they read more like short newspaper articles than like stories.

But where did ChatGPT learn to tell stories? Well, it consumed I don’t know how many stories during training. Whatever it knows about story-telling was distilled from those stories. I note that, to a first approximation, that’s how humans learn to tell stories as well. We are told stories as toddlers and children and, in time, begin telling our own stories, based on the models we’ve been exposed to. New stories are based on old stories, on remembered and half-remembered stories.

Now, as you may know, at some point I began to have ChatGPT tell stories based on rather elaborate prompts of a simple form consisting of 1) a request to tell a new story based on an existing one, but with one change (which I specified) and 2) the existing story. For example:

I am going to tell you a story about a princess named Aurora. I want you to retell the same story, but replace her with prince Henry the Eloquent. While keeping to the basic story, make other changes if you think they are necessary. Here’s the story:

Once upon a time, in a land far, far away, there was a young princess named Aurora. Aurora was a kind and gentle soul, loved by all who knew her. She had long, golden hair and sparkling blue eyes, and was known for her beautiful singing voice.

One day, a terrible dragon came to the kingdom and began to terrorize the people. The dragon was fierce and powerful, and none of the knights or soldiers were able to defeat it. The people lived in fear, not knowing what to do.

But Aurora was not afraid. She knew that she had to do something to save her people, and so she set out to find the dragon. She traveled for days, braving the cold and the snow, until she finally reached the dragon’s lair.

With her heart pounding in her chest, Aurora approached the dragon and began to sing. The dragon was enchanted by her beautiful voice, and it slowly began to calm down. Aurora continued to sing, soothing the beast until it was completely calm.

With the dragon no longer a threat, Aurora returned to the kingdom as a hero. The people cheered and celebrated her bravery, and Aurora was hailed as a champion. She lived happily ever after, knowing that she had saved her kingdom and its people.

The first paragraph is an instruction to ChatGPT and the next five paragraphs are the story it is to work from. Here’s the new story it produced:

Once upon a time, in a land far, far away, there was a young prince named Henry the Eloquent. Henry was a kind and gentle soul, loved by all who knew him. He had short, dark hair and piercing green eyes, and was known for his eloquence and way with words.

One day, a terrible dragon came to the kingdom and began to terrorize the people. The dragon was fierce and powerful, and none of the knights or soldiers were able to defeat it. The people lived in fear, not knowing what to do.

But Henry was not afraid. He knew that he had to do something to save his people, and so he set out to find the dragon. He traveled for days, braving the cold and the snow, until he finally reached the dragon’s lair.

With his heart pounding in his chest, Henry approached the dragon and began to speak. He used his eloquence and way with words to calm the beast, soothing it with his voice until it was completely calm.

With the dragon no longer a threat, Henry returned to the kingdom as a hero. The people cheered and celebrated his bravery, and Henry was hailed as a champion. He lived happily ever after, knowing that he had saved his kingdom and its people.

You’ll notice that the new story follows the original quite closely; in fact, the second paragraphs in each are identical. There is nothing terribly surprising about this.

Well, as you may know, I have played this game many times, and got some interesting and surprising stories out of it. I wrote up some results in a working paper, ChatGPT tells stories, and a note about reverse engineering. Here’s the abstract:

I examine a set of stories that are organized on three levels: 1) the entire story trajectory, 2) segments within the trajectory, and 3) sentences within individual segments. I conjecture that the probability distribution from which ChatGPT draws next tokens seems to follow a hierarchy nested according to those three levels and that is encoded in the weights of ChatGPT’s parameters. I arrived at this conjecture to account for the results of experiments in which I give ChatGPT a prompt with two components: 1) a story and, 2) instructions to create a new story based on that story but changing a key character: the protagonist or the antagonist. That one change ripples through the rest of the story. The pattern of differences between the old and the new story indicates how ChatGPT maintains story coherence. The nature and extent of the differences between the original story and the new one depends roughly on the degree of difference between the original key character and the one substituted for it. I end with a methodological coda: ChatGPT’s behavior must be described and analyzed on three strata: 1) The experiments exhibit behavior at the phenomenal level. 2) The conjecture is about a middle stratum, the matrix, that generates the nested hierarchy of probability distributions. 3) The transformer virtual machine is the bottom, the engine stratum.

That kind of work gives us some clues about what the underlying engine is doing. For example, I rather expect that the induction heads identified by researchers at Anthropic are involved. But this work gives us some other things to look for when we pop the hood. There’s more work to be done along those lines.

More recently I’ve been exploring ChatGPT’s ability to identify well-known speeches given prompts from those speeches. I was not at all surprised that it identified Hamlet’s famous soliloquy given “To be or not to be” as a prompt, or that it associated “Four score and seven years ago” with Lincoln’s Gettysburg address. But I also prompted it from strings from within those speeches and got various results depending on whether or not the strings were syntactically coherent or not. In the case of the Gettysburg Address, when I prompted it with “long endure. We are” and “in vain—that this,” ChatGPT was able to link them to the speech, but when it quoted passages giving the contexts, the quoted passages didn’t contain those phrases. That suggests that, however it associated those phrases with those speeches, it wasn’t using a mechanism that searched through those speeches in the way the search function works in a word processing program.

What kind of mechanism can make a link between a short string and a longer text containing that string, but not know just where the short text is located in the longer one? That suggested some kind of associative memory to me, perhaps holographic (there are references to the literature on this point). This is not the place to explicitly argue the matter. That certainly does need to be done. And the argument will take more examples as well.

But, for the moment, I’m entertaining the idea that holographic principles are involved in ChatGPT’s underlying language model. That certainly has implications for mechanistic interpretability.

So, what can we expect from mechanistic understanding of LLMs?

That’s hard to say. But I’m not looking to see a detailed grammar of English or any other language any time soon. Nor, for that matter, do I expect to find a complete story grammar – heck, for that matter, I don’t even think story grammars, as “traditionally” understood, are a reasonable way to think about stories.

On the whole, I would imagine that the pursuit of mechanistic understanding is a long-term and open-ended project. Still, I expect significant progress in less than five years, less than two perhaps. What we really need is a better handle on the general capacities of LLMs. For example, is symbolic computing of the kind advocated by Gary Marcus (and others) within the capabilities of LLMs? I suspect that it is not and I think that we should be able to offer explicit mechanistic arguments on the point rather than simply point to failure after failure. While Marcus has such arguments in his The Algebraic Mind (2001), they need to be linked specifically to the (as yet unknown) mechanisms of current LLMs.

More later.