How LLMs Learn: What We Know, What We Don't (Yet) Know, and What Comes Next

Humans are amazing.

And–let’s be honest–pretty weird.

I mean, why are so many of us all hyped up about Large Language Models (LLMs)? How did we collectively decide this kind of automated decision-making is “the next big thing”? It’s not like a talking thesaurus can change the world, right?*

The thing most people seem to miss is that LLMs don’t understand humans.

They can generate high-quality content, true, and some of them are already in the top 95th percentile when it comes to processing text, video, medical data etc. But they have no idea what a human “is”.

Don’t get me wrong, I think LLMs are an amazing technology–I’ve been working with language models since 2017–but I am also quite sceptical about the world-changing potential these models have.

So I thought it would be good to do a deep dive into how LLMs learn.

Let’s dive right in.

Part one: Training Large Language Models

To start, like any other machine learning model, LLMs learn from examples.

These examples are selected by humans based on their ability to teach the model something about the task or tasks that need to be automated.

For example, if a machine learning researcher is training a model that needs to generate text, he or she will feed the model text examples.

Researchers have worked on different combinations of inputs and outputs based on the success of early LLMs. As a result we now have models that

… can generate images from text. They are shown examples of text as input, and examples of images as output.
… can generate translations. They are shown examples of text in one language as an input, and a (human-) translated version of that same text as output.
… can decipher proteins. They are shown images of protein structures as input, and mapped-out components of these structures as output.

You get the picture.

The sum total of the examples shown to a model is called its “training data”.

People working on a model will tell it what to learn by configuring the prediction error rates that need to be reduced (in jargon: the “loss function”).

Let’s try to illustrate with an example.

Say you have a bakery you go to every day, and because you are a regular customer you know on a regular day croissants will run out by around 09:30 AM.

Then your “training data” is your earlier visits to the bakery, and the prediction problem is whether or not there will be croissants by the time you arrive at the bakery.

Through earlier visits you’ve established a baseline: your best bet of getting fresh croissants is by visiting the bakery before 09:30.

That doesn’t mean your predictions will always hold true. For example, it is very much possible a conference in town leads the bakery to sell out its entire stock of homemade pastries by 08:30 AM.

Machine learning models learn along similar lines. They are shown different examples of input (e.g. arrival times at the bakery) and outputs (e.g., fresh croissants, y/n?). They then use clever statistics tricks to find the configuration for its inner variables that is the best “fit” for the examples in the training data.

Prediction error rate reduction for a toy classifier. The model needs to find the best configuration to separate red dots from blue dots. The coloured areas are the variables learnt by the model after seeing the training data 0, 5, 10, 25, 50 and 100 times or “iterations” (code generated by Claude 3.5).

In the example above, after going over the training data 100 times, the toy model has learnt to make multiple cutoffs. The same as you learnt that it’s very likely there will be no more croissants after 09:30, the model has learnt several ranges of values in which dots are more likely to be either red or blue.

It has done so by minimising the prediction error, which can be seen from the increased accuracy of its predictions from iteration 0 to iteration 100.

LLMs are–in this sense–nothing different from other machine learning models.

Pre-training LLMs

They are however different in both the size of their training data and the number of model variables they can use to represent the data with. In the toy example above, there are 10∗10=10010∗10=100 different variables (or “model parameters”) that can be “learnt” by the model from the training data.

In 2024, LLMs have between 10 billion and one trillion model parameters (hence the moniker “large” language models–a “small” language model will have between 2 and 10 billion parameters).

Number of trainable model parameters for the models for which this data has been made publicly available. Commercial LLM providers (notably OpenAI and Anthropic) have stopped publishing this information when the generative AI hype started taking off (source: Epoch AI).

These large numbers of model parameters make sense when you look at the size of their training data. The most popular and widely used LLMs (e.g. GPT-4o, Claude 3.5, Llama 3 etc) are trained on what practically amounts to all the text on the internet.**

As an example, a model like Llama 3 with 70 billion parameters–the biggest model for which we currently have publicly available information–is trained on 1.5𝑒131.5e13 or fifteen trillion words!

The number of words in the training datasets of several well-known language models. 1e13 is ten trillion in English, according to ChatGPT (source: Epoch AI)

These models also cost a pretty penny to train. GPT-4 set OpenAI back USD 41M in compute alone. And this is just the compute-per-minute cost, which excludes the costs in personnel, research, engineering and dataset preparation that are also needed to train these beasts. Some internet sources estimate that developing Llama 3 set Meta back somewhere in the 1 to 2 billion USD range.

All these parameters and all this data are needed so that LLMs can learn the basics of human language***. When training LLMs, researchers have found that one approach that works well is to show the model the same sentence as both input and output while hiding one or more of the words in the output.

By learning to correctly “guess” the hidden words in the output sentence, LLMs are able predict the next word in a sequence to a very high degree of accuracy. This little trick is at the foundation of all recent advances in AI!

Let’s look at an example. I asked Claude 3.5 Sonnet to generate the code for a toy LLM, along with code to train it on Shakespeare’s corpus.

This is the code for the “transformer” neural network it generated. This neural network architecture is a lower-parameter and simplified version of the same architecture used in practically all state-of-the-art “GPT” LLMs:

class SimpleTransformer(nn.Module):
    def __init__(self, vocab_size, d_model=64, nhead=2, num_layers=2):
        super(SimpleTransformer, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoder = nn.Embedding(1000, d_model)
        encoder_layers = nn.TransformerEncoderLayer(d_model, nhead, dim_feedforward=128, batch_first=True)
        self.transformer_encoder = nn.TransformerEncoder(encoder_layers, num_layers)
        self.fc_out = nn.Linear(d_model, vocab_size)

    def forward(self, src, return_activations=False):
        positions = torch.arange(0, src.size(1), device=src.device).unsqueeze(0).expand(src.size(0), -1)
        embedded = self.embedding(src) + self.pos_encoder(positions)
        encoder_output = self.transformer_encoder(embedded)
        output = self.fc_out(encoder_output)
        
        if return_activations:
            return output, embedded, encoder_output
        return output

The main neural network architecture used for LLMs today is the “transformer architecture”. 3blue1brown created a great video explaining how transformers work.

Before training the model, I asked it to generate text based on the following input:

To be, or not to be, that is the question:

This is what our SimpleTransformer model generated:

Qow,lLxRPJ'wQOImwAYOa-avDeI,a?x,xC
laBQU-,P,vFKWiH:KfJBqSgFQ&o
FhvOJKEsjBQPlDd&;nnn!twyjb!YMVjkzHJMnkcBOcmF$W'&jXacilcMFFTCk&Xwg;jHB'sw:aYYUjih'iJPiFUbBacs-FvyDv;$haMP!ZMx-HAzjdpfgK''Ak!bObmoj,3!xvLcw

Not very interesting, right? The fact that the model even generates anything at all is because its model parameters are initialised with random values before training. This is another trick that researchers have stumbled on that just works.

If we look at the neurons**** that were activated in the untrained model we see the same degree of randomness we saw in the output. They are all over the place:

Mapping the token inputs to the corresponding model parameters before training the model (the code to generate these visualisations was also generated with Claude 3.5 sonnet).

In this case, we’re training the model to complete sentences, so for an input like “To be, or not to be, that is the question:”, we’d show the model “Whether ’tis nobler in the mind to suffer” as output.

Now let’s look at the output after training the model for 10 iterations–after asking it to look at each sentence pair in Shakespeare’s corpus 10 times.

uqubtt ub, u ob,
nnbttobnnottinottototttin ntiaiatiiaiia unst ie ttonty,
osoiatoeoobibttiu,iril utttolybnottettttehootimt
intitoebuieiuiiioteiouiiinatiantoieisuianubeienltctirb'iiniitiuiuilt,ltiilbbii

Admittedly, Shakespeare probably said it better–but you can already see the model is starting to learn patterns from the English-language corpus it has been training on. There are no more random or uppercase characters, and it’s introduced spacing and commas in the generated text.

Let’s see what this looks like at the level of individual neurons–the parameters of the model that determine the output it will generate:

The token inputs now activate different neurons in the neural network.

It shows the emergence of the first patterns in the neurons, and a higher level of contrast in the activations than before–with the same input. If we were to run this for another forty iterations, we’d see the first syntactic patterns and English-language words emerge in the generated output.

The neat thing about the transformer architecture that researchers at OpenAI suggested back in 2020–which basically kickstarted this whole LLM craze–was that if you increase the number of parameters and the size of the input data enough, your LLM will actually start to generate high quality syntactically and semantically correct text!

Instruction tuning and supervised fine-tuning

But just having a working model of human languages isn’t enough. If you’ve ever interacted with an LLM that has only been “pre-trained”, you’ll know that its generations will often miss the point completely, the model won’t know when to stop generating text, and generations will very likely to devolve into complete gibberish at some point.

This is where instruction tuning and supervised fine-tuning (SFT) come in.

These is a set of techniques to teach LLMs how to respond to human input by showing them examples of text inputs and outputs in a conversational context.

Whereas during pre-training LLMs are shown raw text, instruction tuning data is often conversational in nature, since it needs to teach the LLM how to respond to human inputs. Think of data like question-answer pairs or movie scripts.

Similarly, SFT data is domain-or task-specific, since it needs to teach the LLM how to complete tasks in a certain context or domain (for example in a medical setting).

Training the model on this kind of data provides it with a baseline of human expectations–of the kind of responses humans expect, of how much text it should generate, as well as other domain- or context-specific information humans expect it to have access to for its generations.

A great example of an LLM that in my opinion has been fine-tuned very well is Claude-3.5 Sonnet. My guess is the Anthropic team spent a lot of time curating a high-quality instruction-tuning dataset. This has resulted in a model that produces much more useful generations than GPT-4o.

Since the type of data needed for instruction-tuning is much more rare and harder to come by than data for pre-training, the volumes of data used in this stage are also much smaller–in the tens of millions of examples, rather than the billions or trillions of examples of the internet-scale pre-training data.

Creating instruction and SFT datasets is also where a lot of the budget of LLM providers like Google, OpenAI, Anthropic and Meta is allocated. They often rely on people in low-income countries to manually curate these datasets for them.

Preference optimisation

A last step that has become a common practice is to teach LLMs our preferences for certain responses by using the feedback users provide. This can only be done after the model has been made available for public use, so data volumes here are often even lower than in the SFT or instruction tuning datasets. The one exception to this rule is OpenAI, because ChatGPT has hundreds of millions of active users (I’m ignoring Google since they have a bit of work to do getting their genAI teams sorted out).

The techniques used by LLMs to learn from human preferences rely on the fact that due to their stochastic nature, LLMs can generate multiple distinct outputs from the same human input.

However, in order to take advantage of this fact and teach the LLM which output users prefer, researchers have to first learn a model of these human preferences. As we have seen, LLMs by themselves are trained to predict words in sentences. They have no idea what individual humans might be interested in.

In fact, all the “knowledge” on what humans find interesting stored in their model parameters is a byproduct of them learning patterns in human language.

So in order to teach LLMs user preferences (“optimise” them, in jargon), we first need to be able to model user preferences. This is usually done with a technique called reinforcement learning, which learns what LLM generations among all the possible generations are preferred by users.

All their “knowledge” of us is a byproduct of LLMs learning patterns in human language.

Once a good model of human preferences has been learnt, it can be used to directly improve the LLM output by tweaking (“fine-tuning”) the layers of the LLM that determine the final output of the LLM.

The reward model learns to predict LLM outputs preferred by humans. It is then used to further improve (“fine-tune”) selected parameter layers of the LLM (image: HuggingFace).

… and beyond

Most LLMs used today are trained with one or more combinations of these three techniques. AI researchers are working on novel approaches such as self-play (where LLMs are learning by talking to each other or themselves), but the current generation of LLMs is trained using pre-training, supervised and / or instruction tuning, and preference optimisation methods.

These techniques map naturally to the datasets available–internet-scale raw text data for learning human languages, curated data for learning how to respond, and data generated from human interactions to learn which responses humans prefer.

The strange thing is that researchers today don’t really know how LLMs generate their outputs. There are two main issues. One is the size and complexity of these LLMs. That makes figuring out which of the tens of billions of parameters are reacting to inputs and shaping the outputs of LLMs a very hard task. Researchers at Anthropic have been making some interesting inroads using a technique called dictionary learning, which we’ll discuss in the next section.

LLM model training techniques map naturally to the datasets available–internet-scale raw text data for learning human languages, curated data for learning how to respond, and data generated by human interactions to learn which responses humans prefer.

The second issue is the empirical nature of AI research. A lot of the canonical techniques and tricks used to train LLMs have been discovered by researchers in AI labs around the world trying a bunch of different things and seeing which one would stick. In this sense, AI research is a lot closer to an engineering discipline than a lot of researchers and professors would have you believe. We’ll dive into the implications of this approach for the “AI revolution” in part three.

Part two: The Emergence… of Something?

One of the main questions AI researchers have been struggling with is how the neurons of LLMs–the learnt mathematical representations–map to semantic units in human language. In other words, how neurons in an artificial neural network map to concepts like “trees”, “birds”, and “polynomial equations”–concepts that neuroscientists have shown to have a biological basis in our neural substrates.

The main issue is that the same neuron in a neural network can activate for many different inputs–e.g. you’d see the same neuron fire whether the input is Burmese characters, math equations, or abstract Chinese nouns*****. This makes it pretty much impossible for us humans to interpret what is going on inside an LLM.

At Anthropic, they’ve tried to tackle this problem using a method called dictionary learning. The key idea driving this line of research is the hypothesis that the neural networks we end up with after training an LLM are actually compressed versions of higher-dimensional neural networks–that somewhere during training, neurons become “superimposed” onto each other.

A key feature of the “superposition hypothesis” is that neurons of LLMs will take on different semantic meaning depending on the input vector (image source: Anthropic, 2023).

This would mean that the neurons of LLMs are polysemantic–exactly the problem we were trying to solve! For the details of dictionary learning and the method they used to disentangle the semantic units in neural networks–its “features”–I highly recommend reading their well-written blogpost on this.

Just because it works doesn’t mean it’s understood (image: Anthropic, 2023)

I’m not a computer science major, so when I think of compression I think of something like g-zip. Ignoring for a moment that this (compression, not g-zip) is the foundation of all modern information theory, it’s very hard to see how a simple step like compressing a neural network can lead to the reasoning abilities we see in top-of-the-line LLMs.

The thing that is most astounding to me–which is mentioned in a side-note in the Anthropic write-up–is that this type of compression is known to occur only when neural networks are trained with a specific function to reduce prediction errors called “cross-entropy loss”:

where:

𝑁 is the number of sequences in the batch.
𝑇 is the length of the target sequence.
𝐶 is the number of classes (vocabulary size).
𝑦𝑖,𝑡,𝑐 is a binary indicator (0 or 1) if the target token at position 𝑡 in sequence i is class c.
𝑦^𝑖,𝑡,𝑐 is the predicted probability that the token at position 𝑡 in sequence i is class 𝑐.

This formula is used to quantify the prediction error rate of LLMs by providing a numerical value for the generated sequence to sequence mappings (the input and output examples used when training the LLM).

And somehow along the way we end up with technological artefacts that are able to reason through and solve problems at practically the same level as humans!

Nothing in the way humans use language suggests this has happened before. I’ve done a lot of research on the evolution of languages over time, and on how languages relate to knowledge systems, and can’t think of any historical process that would generate the same kind of cultural compression that training a neural network does. Even time itself doesn’t result in anything like this.

A projection of the scaling laws for transformer models(image: Leopold Aschenbrenner, Situational Awareness)

Part of my bewilderment stems from the fact that language as a means of communication has many flaws. It’s not a pure, exact or even particularly successful representation of human thoughts–of our internal states. States that also happen to be embodied in a central nervous system and biomolecular process that have taken 4 billion years to refine.

Somehow along the way we end up with technological artefacts that are able to reason through and solve problems at practically the same level as humans!

But somehow, LLMs trained on text–on technological artefacts produced in the technology that is language–are able to pick up on enough patterns to mimic human reasoning and problem-solving skills.

It’s still quite astounding to see LLMs reason through problems when I am building AI applications that leverage their reasoning capabilities.

One possible explanation I’ve read for the massive jumps in reasoning capabilities from GPT-2 to GPT-3.5 and beyond is that researchers started including source code data in the training datasets of LLMs. While this seems plausible, I haven’t come across any clear evidence that this is really what is happening.

I guess you could look at evolution as a form of compression, of iterating over traits in the same way an LLM iterates over the “features” found by Anthropic researchers. The main difference–and where the analogy breaks down–is that the traits that have been most successful in natural selection combine effectiveness to cope with a specific environment with adaptability to new environments.******

It is unclear at this point how well LLMs will work in agentic systems that need to do a lot of context switching, since this is an ongoing area of research in both industry and academia. My personal experience is that LLMs require a lot of guardrails to ensure they perform even reasonably well in any given context.

In this sense, compression is definitely not producing the same results as natural selection–LLMs miss the kind of information-seeking drive all living beings have.

Part three: Building World Models

What does all this mean for the future of AI? For one, to me at least it is very clear we haven’t yet “solved” AI, AGI, superintelligence, or whatever else you want to call it with our current set of machine learning methods.

Even though people like Leopold Aschenbrenner make a very convincing case the path towards superintelligence is scaling compute, I don’t think the only thing holding back vLLMs (very Large Language Models) from taking over the world is the sandbox in which they are deployed.

In other words, I don’t think it’s down to an engineering problem just yet.

People using vLLMs the right way are a different thing altogether, obviously.

I think we need some major innovations in algorithms and representation learning before we will have truly autonomous agents–”AI” in the sci-fi sense of the word.

In LLMs, as I hope has become clear from reading this blogpost, the information-seeking behaviour is an after-thought, bolted on by humans during preference optimisation like the guardrails that make GPT-4o refrain from generating racist, sexist and other reputationally damaging outputs.

In fact, most of the successful neural network solutions in the domain of computer games–where neural networks are allowed to act on their environments–have been combinations of neural networks and reinforcement learning. Large neural networks (like LLMs) learn to process and compress environmental data, and the reinforcement learning model then learns how to act on the environment using this compressed representation of the environment.

In all of these applications, it is the reinforcement learning agent that is driving the exploration, information seeking, and acting–and they are horribly inefficient.

I don’t think it’s down to an engineering problem just yet.

So how should we look at the rise of LLMs? Is this a moonshot like the Apollo program, as Leopold Aschenbrenner and many others in Silicon Valley would have us believe? Or is it something closer to the dot.com bubble–where there are real use- and business cases for the technology, but they will take a lot longer to realise and be a lot less transformative than AI marketing gurus would have us believe?

I think–but I could well be wrong–that a more fruitful way to look at LLMs is to view them through the lens of the technological breakthrough of a different era–that of the industrial revolution.

The main driving force of social, technological, and economic change in that period was the steam engine. The switch from biological to carbon-based energy sources enabled us to concentrate much more kinetic energy into much smaller containers, culminating in the automobiles, airplanes and spaceships breaking down physical distances for humanity today.

In the same way, LLMs could be seen as the steam engines of the information age, allowing us to switch our cultural evolution from one technology–language–to another–computing. The issue that we then run into is one voiced by numerous smart people around the world, namely what problem do they solve?

What is the modern-day equivalent of the kinetic energy the steam engine allowed us to leverage and control to a much bigger degree?

In my opinion, there is only one valid answer –human knowledge. And I think the place where LLMs will have the most leverage is in memory-intensive fields like scientific research, medicine, R&D and education. Replacing human memory with machine memory there will let us reach much further as a species, given the amount of information and knowledge we are producing on a daily basis. This is where these kind of technologies can truly become a force multiplier.

While there are some applications for LLMs in creative professions, I think those will be limited to the same role search engines play today. I do expect LVMs and VGMs to have a more significant impact, but more in the role lowering the barriers of entry for documenting and communicating human thoughts. After all, who wants to hear the machine version of a human experience?

This image has been making rounds on social media recently. Seems like a valid point to me. I’ve also written about this in a previous blogpost.

There is also a good case to be made for LLMs to automate or augment a lot of the knowledge work that is currently driving the information economy, allowing us to spend more time away from our devices–working on things that have more direct impact on our social, cultural and economic wellbeing. This would, in my mind at least, be a very positive outcome given that I believe none of us were brought into this world to stare at a computer screen 8+ hours a day.

Such a change would of course also result in a massive period of disruption–the biggest humanity has ever seen given the number of people currently roaming the earth (England had around 6 million inhabitants at the start of the industrial revolution in 1750, 16.7 in 1851, and 56 million today).

Either way, we’re not there yet from how I’ve seen LLMs perform in the day-to-day. I think we need further innovations in AI before computers can be trusted to act correctly and competently on your input.

Maybe the distant past is not that far away? (photo taken at TNW Amsterdam 2024)

Notes

*) Unless you live in Oz, that is. But they play loose and fast with their bipeds in more ways than one. In case you’re interested, here’s a complete rundown of the demographics and economics of Oz generated by Perplexity.ai.

**) This is done by scraping the most-visited websites of the internet. “Scraping” is the process of automatically downloading the contents of websites and using that data for your own purposes. For example, to train a large language model. If you’re interested, have a look at https://commoncrawl.org/–one of the most widely used datasets of this kind.

***) Or human languages, since most contemporary LLMs are multilingual.

****) ChatGPT (GPT-4o)’s definition of a neural network neuron is:

In the context of artificial neural networks, a neuron (often referred to as a node or unit) is a fundamental component that processes input data to produce an output. The concept is inspired by biological neurons, but it operates in a mathematically simplified and abstract manner.

*****) Hypothetical examples for the purposes of illustration.

******) Several AI research labs are working from this evolutionary angle to “breed” new LLMs by combining traits from existing LLMs. The most prominent of these is Sakana.ai in Japan.