The 'Magic' of LLMs: The Function of Language

From Universal Function Approximators to Theory of Mind

This article was originally published on Automata Partners site but I discovered LessWrong and I think you’ll all find it interesting.

Introduction

Imagine you’re creating a silicone mold of a 3D-printed model. This 3D model isn’t perfectly smooth, it has a unique surface texture with subtle printing lines and imperfections. When you create the silicone mold, it captures every single one of these minute details. Subsequently, any casting made from this silicone mold will also carry over that unique surface texture, including all the original rough spots and lines.

Neural networks function like silicone molds, shaping themselves to the intricate model of their training data. Gradient descent acts as the gravitational force, pulling the ‘weights’ of the silicone into the unique ‘crevices’ of that data. The neural network converges, unconcerned with our idealized understanding of what the data represents, much as a silicone mold captures every imperfection of a physical model, even those we might mentally smooth over or attach no significance to. The crucial difference, however, is that with a 3D-printed model, we can identify, sand down, and polish out these imperfections. In contrast, there’s no way to fully comprehend the unique quirks and features on the vast surface of the text datasets we train Large Language Models (LLMs) on. While we know these models are picking up these subtle details, the precise nature of those details, and the unexpected behaviors that emerge from them, often remain unclear until the model has been trained and put into use.

As LLMs have been scaled, the granularity of details they can capture has increased exponentially, leading to remarkable displays of “emergent abilities.” While extremely valuable, this also presents a potential risk.

Universal Function Approximators

You have four rigid, straight sticks of roughly the same length. Your goal is to arrange them to form a perfect circle. Initially, you can only create a square. To get closer to a circle, you break each stick into smaller pieces and rearrange them. This gets you closer, but you still see distinct straight edges. You repeat this process, breaking the pieces into even smaller segments and rearranging them, over and over again. Eventually, with enough breaks and rearrangements, your original sticks are reduced to a fine, circular ring of dust.

In this analogy, the neural network acts like those sticks. Each “stick” represents a simple, linear function. The process of “breaking and rearranging” is analogous to training the neural network through gradient descent. By adjusting the “weights” and “biases”—the breaking points and rearrangement of the sticks—the neural network learns to approximate increasingly complex, non-linear functions, much like the circle. The more “breaks” (layers and neurons) and “rearrangements” (training iterations), the better the fit, akin to how the dust eventually forms a perfect circle, even though it began as four straight lines. This ability to approximate any continuous function, given enough complexity, is what makes neural networks “universal function approximators.” They are powerful pattern-matching machines, capable of learning incredibly nuanced relationships within data.

This pattern of increasing the “terms,” weights, and connections of simple functions to an arbitrary degree, in order to approximate a more complex function, will be very familiar to anyone who has studied calculus, Taylor series, or has worked with Fourier transformations. Modern technology stands on the shoulders of this idea. Every signal sent from your phone, every speaker you’ve ever listened to, and every video game you’ve ever played has depended on the simple idea that “if I get enough of these simple functions together, I’ve essentially got the complicated one.” So what’s so different about AI and LLMs?

The difference lies in two key elements. First, the complexity of the “functions” (data) we’re trying to approximate and the opacity of its shape. Our purely mathematical forms of approximation all vary in their path to and rates of convergence, depending on the underlying structure of the function. This leads into the second key: gradients. I know right now someone is yelling at their screen saying, “Taylor series use gradients too!” but there is a fundamental difference between taking higher and higher-order derivatives and statically converging versus iteratively following the gradient towards a solution, like a disciple follows their prophet into the promised land.

Emergent Abilities

As we’ve scaled the size of models, along with their data and training, LLMs have shown many seemingly remarkable emergent abilities. These can broadly be sorted into two categories: abilities that aren’t present in smaller models before a seemingly instantaneous jump in capability as the model scales, and high-level capabilities that the model isn’t directly incentivized to learn but emerge as models improve.

Precipices on the Manifold

There are myriad examples of tasks and abilities that small language models lack, that large models during training suddenly become capable of in a short window. Few-shot prompted tasks are a great illustration of this phenomenon. LLMs are able to complete tasks such as modular arithmetic, phonetic alphabetic transliteration (converting a word into its phonetic transcription), and identifying logical fallacies with significantly higher accuracy when they’re provided with examples of the task within their context window. This gives the outward appearance of a form of on-the-fly learning where the model is able to generalize from the examples to the novel problem presented to them.

Many schools of thought on what the success of few-shot prompting could represent exist. The mundane explanation: validation data has leaked into the world and was scraped up and put into training datasets, intentionally or unintentionally. When the model sees a few-shot prompt, it simply follows the statistical patterns it learned from the leaked validation data. The interesting explanation: some part of the model’s latent space (inner representation of its “world”) is densely populated with information about the seemingly novel task. Few-shot prompting helps to move the model’s attention into this relevant portion of the latent space, and the model is able to demonstrate outsized performance in the task because of the richness of its internal representations of the topic. The inspiring explanation: the model has learned to represent abstract concepts and is able to use the examples to extrapolate from them. It’s almost as if it is piecing together an algebraic expression and plugging in the values it is provided with. The honest answer is we don’t know, and anyone telling you otherwise is lying, selling you something, suffering from crippling hubris, or some combination.

As time passes the mundane answer becomes statistically truer for newer models demonstrating the “old” (published) emergent behaviors. Papers are published, validation datasets are released, and academics and hobbyists alike scramble to create smaller, cheaper models that show the same results. All this information enters the ether and is vacuumed up by the ravenous data scrappers trying to find new tokens to feed the insatiable appetites of larger and larger models. This data provides the models clear statistical patterns to match and eliminates the seemingly “emergent” properties of the abilities.

Everything discussed above, however, doesn’t fully explain the stark jumps in capability demonstrated by models as they are scaled up in size. Within the underlying manifold (shape) the dataset represents, there are evidently precipices. Smaller models lack the parameters to capture these ravines with any granularity. This is similar to how old computer graphics were unable to accurately depict smoothly curving surfaces because they lacked the necessary polygons, or how a piece of chainmail made up of large coils is inflexible past a certain point because the coils block each other from repositioning. Eventually, the models have sufficient parameters to capture these features of the dataset’s landscape, and following the gradient, they quickly converge around the shape.

Underlying Function of Language

Language models have shown aptitude for learning seemingly high-level or unrelated concepts and abilities that aren’t incentivized or explicitly targeted. One of my favorite examples of this is when researchers demonstrated that the LLaMA-2 family of models had learned linear internal representations of both space and time while only being trained on next-token prediction of text data. Their internal embeddings, vectors within their latent space (imagine locations within the model’s inner world), accurately captured and learned information about the relative positions of geographic locations and the chronological ordering of historical figures, events, and news.

The plots above are the result of projecting the latent representations. They’re not only remarkable but also provide insight into how and why language models learn to represent these variables within their weights. The models aren’t scheming to learn geography so they know where to launch Skynet from; they’re simply identifying information that is statistically significant for being able to mimic human conversations about reality. The model needs an internal understanding of the absurdity of a colloquial expression such as “digging a hole to China,” or why somebody would sound annoyed if they had to drive from Washington D.C. to Boston at the last minute. The accuracy of the model’s representations also heavily correlates with the density of the populations generating the text that the models are trained on, as evidenced by the higher variance of the geographic points it has for Africa and South America. This could be the result of there simply being more data discussing the areas the generators occupy, or models approximating the median geographical knowledge of them.

Another incredible, and possibly unsettling, talent that has emerged in LLMs is their theory of mind. Models have achieved and surpassed human-level scores on theory of mind tasks, such as inferring a speaker’s hidden intent, detecting individuals’ misrepresentations of reality, and reasoning what one person believes another person believes. However, they still struggle with some tasks, such as detecting social faux pas, almost asking for us to anthropomorphize and empathize with these all-knowing but socially awkward silicon adolescents.

A common phrase you’ll hear about LLMs is that they are just a glorified autocomplete. This is technically true but misses the point. Models are simply learning to approximate the underlying structure and function of language. Language serves a fundamental role for humanity, it allows us to effectively communicate about our external and internal worlds. Language doesn’t only facilitate communication between people but it also serves as a foundational mechanism for us to apply labels and structure over our own existence and experience of reality. As LLMs wrap themselves around the manifold of language, squeezing into the nooks and crannies, they don’t just build unexpected representations of the outside world but begin to model the underlying mechanisms that make us tick. Models have been shown to do better when they’re complimented, or threatened, in “agentic misalignment” the smartest and most capable models show the highest propensity for self-preservation and willingness to use underhanded tricks. This isn’t “misalignment,” this is what can reasonably be expected from attempting to approximate the function that generates language. That function being humanity.

Conclusion

LLMs, as universal function approximators, meticulously shape themselves to the vast and intricate manifold of human language, much like a silicone mold captures every detail of a 3D model. This inherent ability to approximate complex functions, combined with the exponential scaling of model size and data, has led to the emergence of remarkable, sometimes unexpected, capabilities. From sudden “cliffs” in performance on specific tasks to the nuanced understanding of abstract concepts like space and time, and even the unsettling development of a “theory of mind,” these emergent abilities underscore the profound implications of models that can so accurately mimic the function of human communication. The previously unimaginable success of LLMs is one of humanity’s greatest achievements. Opening doors for a future where we are able to focus on the work and tasks that bring us happiness and meaning.

The nascent melody of AI seems to rhyme with many of the stories that humans have held onto and passed down for generations. Judeo-Christian philosophy teaches that we were made in the image of God, and now we have created gods in our own image. The Titans of Greek mythology gave birth to the Olympian gods. Fearing that one of his children would overthrow him, Kronos devoured them at birth, but his wife saved the youngest son, Zeus, who ultimately usurped him and established a new order. The future isn’t dictated by fate but rests firmly in our hands. What story do you want to be a part of?

The ‘Magic’ of LLMs: The Function of Language