“Mode collapse” is a few things. First it was an observation about how early image generating AIs often collapsed to producing just the modal output from their training distribution (something very common, like a house with a white picket fence and a tree in the garden). Then it was the observation that this effect seemed to occur extremely quickly when AIs were trained on AI-generated inputs. After that, it became the copium du jour of AI-is-hitting-a-wall folks for a while, who thought that the AI industry would ouroboros itself out of existence (and that there was, therefore, no need to confront any of the issues that smarter than human AIs might bring up). And then it was forgotten, because it turns out you can train on AI-generated inputs just fine, if you know what you’re doing.
It’s also the reason why grant-making organisations have such strong inertia, why all of your favourite band’s songs sound the same after the third album, and why you should specialise even if there are no gains from trade.
The Image Generator
Imagine an image generating AI, which gets something like this as input:
Original image: https://commons.wikimedia.org/wiki/File:Dog_Breeds.jpg
And suppose it’s being trained to fill in the blank section in the middle. Suppose it’s trained on 50:50 mixture of golden retrievers and tabby cats. For any given image, it first needs to decide whether to try and draw a retriever or a cat, and secondly how exactly the animal should look. The model has a limited amount of parameter space to spend, and it has to split that across three tasks:
Guess whether the missing animal is a dog or a cat
Produce an image of a dog
Produce an image of a cat
We’ll ignore task (1) for now, and think about how the model might split its capacity between tasks (2) and (3). If cats and dogs are equally easy to draw, and if the model gets diminishing returns on capacity in both categories, and if the categories are equally common, then we should expect it to spend an equal amount of capacity on both tasks. Again, since the classes are symmetric, we should expect the predictions coming from task (1) to be 50:50 dogs:cats.
But what happens if we have a 70:30 ratio of dogs:cats? The first-order effect is that the model will allocate more resources to task (2) than task (3). That’s obvious, but there’s a second-order effect as well:
Imagine our model is in a situation where it thinks there’s a 50% chance the missing animal is a dog, and 50% chance it’s a cat (even after taking into account the fact that dogs are more common). It has the following options available to it: generate an excellent picture of a dog, or generate a mediocre picture of a cat. What should it do?
If it guesses the animal wrong, it’s screwed either way, but if it guesses right, it’s better off drawing a dog. Therefore the model should err on the side of drawing a dog! The distribution has been pushed towards the modal target.
If you train another generator on the outputs of the first, it will generate dogs even more often! Mode collapse.
This is a two-part process. Each individual step shifts the distribution slightly towards the mode, and each successive step locks in the shift of the previous step, and compounds it.
This generalises to some other situations as well, if you want to read the footnotes.
Grantmakers
Now suppose you’re a philanthropist hiring a grant-maker to distribute your money. You get a 70:30 mixture of global health proposals to animal welfare proposals.
You, naturally, hire someone who is slightly better at evaluating global health projects than they are at evaluating animal welfare projects. Your employee then awards grant money in a 75:25 ratio, since she can more easily pick the best global health proposals than she can the animal welfare ones.
In your next grant-making call, you decide to expand. Your employee brings on two new hires, and takes them through her process of evaluating grants, based on last year’s grants. They are trained on data in a 75:25 ratio, and become significantly better at evaluating global health proposals than animal welfare ones.
Your Favourite Band
By now you see the pattern right? The band writes its first album, 7 dreamy pop songs and 5 proggy rock songs. Now, since they’re better at writing the former, they do 9:3 on the next album and give up on rock entirely by the third.
But by the fifth album they’re bored of rock, and they decide to make something called “intelligent dance music” (which you’re not actually supposed to dance to, apparently)! The millions they made from touring the fourth album gave them enough time to take a break, and learn how to use [whatever weird gizmos electronic musicians use to make their music]. Slack to the rescue!
Division of Labour
People normally attribute the division of labour to the gains from trade. Trade is a part of it, but not all of it.
The ability to trade goods and services for money puts us in the same situation as the image classifier. Remember when our model thought there was an equal chance (50%) of “dog” or “cat” being the correct image to generate? In a modern economy, you are equally likely to be rewarded for doing a day’s work as an accountant or as a personal trainer, it’s just that the likelihood is ~100% instead of 50%.
Your skill at being an accountant can be thought of as a generative model of “a good day’s accounting work”. Likewise for your skill at being a personal trainer. Once you’ve done a few days’ work as an accountant, you’re feeding off your own generations (kinda) and your model of “a good day’s accounting work” gets better and better, leaving your model of “a good day’s personal training work” in the dust.
Note that this still works without trade: suppose you’re on a deserted island. If it’s sunny, you go hunting, if it’s cloudy, you go fishing. Each time you do an activity, you get better and better at it. If it’s sunny 3⁄4 of the time, you’ll get much better at hunting than fishing. Before long, you’ll be hunting when it’s just a bit clouded over, then when it’s kinda cloudy with patches of sun, and eventually you’ll be hunting every day. At least until you kill all the local fauna. Then you’re screwed.
Slack
I think mode collapse, generalised, is a force which acts on basically all intelligent systems over time. But I don’t think it’s inevitable. If an external force changes the returns to investment (like the fauna population crash in the deserted island), you might be forced to change. More likely, the returns to investment get lower over time, producing some homeostatic effect (quarry gradually becomes rarer, you go fishing more often).
You can also fight mode collapse it if you just don’t like it. You can just do (different) things! The band example showed this.
Unfortunately, you can only do this if you have access to slack. If you need to go hunting every day because you’re now awful at fishing and there’s barely any quarry left and you can’t take a day off hunting to learn to fish! Then you’ll starve. If the record label demands a hit album in the next six months, you don’t have time to learn to make IDM.
Even evolution isn’t immune to this! Highly specialised animals (and plants) are common everywhere—like the caterpillar that can only mature inside the nests of red ants, or those orchids that are pollinated exclusively by a single species of solitary bee—which tend to go extinct at the slightest ecological sneeze.
So beware mode collapse: for you are not immune!
Suppose you have three classes: cats, dogs, and frogs, in a 1:1:1 ratio. Now suppose that some capacity can be shared between generating images of cats and dogs (e.g. generating fur textures) but not between either of those classes and frogs. The optimal strategy is to make use of that efficiency gain, and be slightly better at drawing cats and dogs than frogs, and then secondarily to err slightly on the side of the furry creatures, over the frogs.
Or, suppose cats are just intrinsically easier to draw than dogs. The optimal policy would then be to allocate slightly more resources towards dogs than cats, but not so much that you’re as good at drawing dogs as you are at drawing cats. Then you should err towards cats.
The overall pattern is: things which are dense in thing-space, or things which are simple and easy to generate, are favoured.
Hmm, isn’t the issue that you’re enthalpymaxxing instead of free-energymaxxing?
That’s one way of putting it, yeah: the band who want to explore a new sound, the hunter who gets sick of eating deer every day, and the LLM with an entropy term in its reward function are all of the same ilk.
Say more? I only know about enthalpy in the physical sense. What does it mean here, and how would switching to free energy change things?
I think J Bostock has a good explanation (see the other reply to my comment). I put some more context at the bottom of this comment.
In physics, systems tend to minimize the free energy , not the enthalpy . In all other branches of mathematics (e.g. game theory, RL), they use the right sign convention (where energy is not negative) so you would say systems tend to maximize the free energy , not the enthalpy .
If you are purely maximizing enthalpy, everything will go to the highest-enthalpy state. This is the mode collapse issue you see. But why is the system purely optimizing for enthalpy? Either the temperature is very low, or more likely there are hidden constraints and the enthalpies you see are not actually the enthalpies you get. For example: bias in your loss function, committees judging based on historical convention instead of merit, or fans pushing away fans of other genres.
If your issue is a low temperature, you can anneal to find better outcomes (basically, increase the temperature so entropy is more important leading to more exploration, then decrease the temperature; when you anneal a steel sword, the atoms are doing exactly this—finding better alignments with each other, which makes a more uniform crystal lattice, leading to a less brittle sword).
Once entropy is a consideration, you will still get exponentially more high-enthalpy states, but you get a bigger spread into other states, which prevents mode collapse.
Background
I wrote this awhile ago, but never published it. I think it’s a good primer.
Definitions
Enthalpy (H): The kinetic and potential energy (ability to do work).
Entropy (S): The (logarithm of the) number of possible states.
Temperature (T): (One over) imaginary time.
Gibbs free energy (G): , the log-likelihood (logit) of encountering a particular kind of system.
Suppose we have a bunch of possible states with energies , and an atom (or molecule, or something bigger) is in each state with probability . If we have atoms, the number of possible states is
from Sterling’s approximation, up to a multiplicative constant. Taking a logarithm, we get
In real life, a kilogram of stuff has on the order of atoms, so the second term is going to be much, much larger. Then the entropy,
is pretty much the log-number of states for each atom. If an atom is equally likely to be in any state, then we would expect atoms to exist in systems where many more states are available. This is pretty much where the second law of thermodynamics comes from: systems tend to end up in places with many more states, i.e. higher entropy. Of course, not every state is equally likely, since some take more energy to get into than others. Suppose we have an isolated system, so there is a fixed supply of energy to go around. If we want to maximize the entropy, under the condition
for some fixed , then Lagrange multipliers gives
This is very reminiscent of Schroedinger’s equation,
where (the Hamiltonian) is a matrix where and is the coupling (complex-valued transition rate) between states and . For this reason, temperature is best thought of as inverse imaginary time:
However, we won’t always end up in the highest-entropy systems, we’re just more likely to because there are more available states. How much more likely? It should be proportional to the number of available states, . So, the probability of encountering any given system is proportional to
That term in the numerator is known as the Gibbs free energy. We tend to end up with systems that maximize this free energy, not enthalpy or entropy. If a system does not, there are three possibilities:
We need to let it run a little longer.
It moves through time with a phase shift, , rather than real-valued time (ETA: e.g. in macroeconomics, what this unfinished post was going to be about).
We’re missing a constraint. Perhaps we’re shining a laser at the atoms so the transition to a higher energy state is subsidized, or perhaps there is a filter that blocks larger molecules from one half of the experiment.
If you there are rewards associated with actions, and you choose a distribution over the action space, then the enthalpy is your expected reward and your entropy is the entropy of the distribution. The free energy depends on a choice of a temperature , and is (up to a constant shift up or down) . Maximizing the free energy ensures you do some exploration of suboptimal choices; the equation is that if is the reward for action , then your distribution should be , which can be derived using Lagrange multipliers.
I don’t think this is sufficient to avoid modal collapse, however. LLMs when they are trained on their own output experience modal collapse, but they are using this very same equation for the distribution of next-tokens.
Maybe there is some way to reward exploration explicitly, so that less-frequent tokens have a boost in visibility, but that sounds like a research question, not established fact.
Edit: I have a sign error. It should be here.
Maybe upstream of attribute substitution? Get better and better at answering certain types of queries, so you round off other queries to the ones you are already good at.
And that rhymes with professional deformation.
(Of course this conflict among housemates is a systems engineering problem; everything is a systems engineering problem. Now let’s consider the flow of Amazon boxes through the house, because there’s a bottleneck somewhere causing a space leak...)
The logic here about specialization is very similar to some of the logic that showed up in a subsection of the Internal Information Cascade article and I think it might be the case that mode collapse is simply another name for “that process” scoped specifically to the field of generative AI where it is deemed a bug… Maybe? For this reason I added the Information Cascade tag to this article <3
Practically, mode collapse seems like a bad thing by itself if (a) underlying reality shifts, or (b) if your beliefs in the first place were incorrect. Example of (a) would be when, after your mode collapsed, animal welfare becomes 80% of the proposals. Example of (b) would be an image model that “didn’t know” that diversity of outputs is in itself a value.
(b) doesn’t seem as bad for humans, because if we are investigating our beliefs, and find out some of our previously held convictions were wrong, we can try to trace back what decisions those informed, and break out of harmful mode collapse.
(a) is worse, because mode collapse deprives us of the signal on the distribution shift itself, making it hard to detect if it happened.
Good news are, solving (a) doesn’t require taking random walks periodically to balance out this exploration / exploitation dilemma. I would wager in most situations taking explicit action to check for distribution shift is cheaper and more efficient. Coming back to the grantmaker example, periodically checking true market distribution of grant proposals between global welfare and animal welfare is presumably cheaper than randomly trying out hiring people who are really good at evaluating animal welfare.
PS: This is ignoring effects that your mode collapse has on the market of grant proposals itself, which is unrealistic. That is why I start with “Practically, mode collapse seems like a bad thing by itself”.
I think tying mode collapse to work is misleading, and slack is just a proxy for exploration and diversity.
Can you say more about why you think this?
It is possible to construct examples of people who are mode collapsed with maximal slack, and people who aren’t mode collapsed despite always being stuffed for time.
Are you only talking about time slack, or are you including other things? I would be curious if you have an example that isn’t immediately vulnerable to “the slack deprivation centers on a different taut constraint” reframings.
it is vuln only if your definition of ‘slack deprivation’ can be caused by giving people more freedom, which undermines the implicit goal of the ending