Language models can generate superior text compared to their input

There’s a frequent misconception that assumes that a large language model will never achieve superhuman text creation ability because such models try to create texts that are maximally unsurprising. This article will explain why that assumption is wrong.

In 1906, Sir Francis Galton conducted an experiment at a fair, where he asked fair-goers to guess the weight of an ox in a weight-judging competition. The median of 787 guesses was 1,207 pounds, while the actual weight of the ox was 1,198 pounds. The error in making guesses was a result of a combination of systematic bias and random noise. The fair-goers, having knowledge of oxen, had no bias in their guesses, thus the error was entirely due to random noise. By polling the 787 guesses, Galton averaged out the random noise of each individual guess.

This phenomenon was coined wisdom of the crowd. In areas where reasoning errors are mostly random noise, crowds are smarter than individual members of the crowd. By training on large data sets, large language models can access the wisdom of the crowd. The ceiling of the ability of a large language model is the wisdom of the crowd instead of the wisdom of individual members of the crowd.

The fact that each word of a text is massively unsurprising based on preceding words in the text does not imply that the text overall would be massively unsurprising. If you have a text you can calculate for every word in the text the likelihood (Ltext) how likely it would follow the preceding words in the text. You can also calculate the likelihood (Lideal) of the most likely word that would follow the preceding text.

Lideal—Ltext is noise. If you look at a given text you can calculate the average of the noise for each word. A well-trained large language model is able to produce texts with a lot less noise than the average of the text in its training corpus.

For further reading, Kahneman wrote Noise: A Flaw in Human Judgment which goes into more detail on how a machine learning model can eliminate noise and thus make better decisions than the average of its training data.