Quick Thoughts on Language Models

Epistemic status: Somewhere between Exploratory and My Best Guess. Plenty of other people have written about similar or related ideas, and about alternative or conflicting views.

Epistemic effort: Ideas developed slowly over about 3 years, via learning ML, playing with language models, reading work by others, and discussing language model capabilities with various people. I’ve had a few recent discussions particularly related to the ideas below. About 1 hour of writing and editing this before posting.

Thoughtful feedback on this post is very welcome!

Language models trained on next-token prediction learn to model text-generating processes. The data that they are trained on—mainly massive internet text corpuses—encode a lot of information about the causal structure of the world.
- For instance, there is a lot of text online about objects falling when people let go of them, which allows language models to learn the basics of how gravity works. Further, there is text about what happens when things are dropped on the Moon or the ISS, and about Newtonian mechanics, and about general relativity, etc., so language models can learn how gravity works in a much deeper and more general way than that too.
Since the processes that generated the text on the Internet are so diverse, language models have the potential to model a large amount of the causal structure of the world.
The limit of performance at next-token prediction, and text-generation process modeling, is far superhuman. The argument for this that I found most convincing was the following passage from GPTs are Predictors, not Imitators by Eliezer Yudkowsky:

Imagine somebody telling you to make up random words, and you say, “Morvelkainen bloombla ringa mongo.”
Imagine a mind of a level—where, to be clear, I’m not saying GPTs are at this level yet -
Imagine a Mind of a level where it can hear you say ‘morvelkainen blaambla ringa’, and maybe also read your entire social media history, and then manage to assign 20% probability that your next utterance is ‘mongo’.
The fact that this Mind could double as a really good actor playing your character, does not mean They are only exactly as smart as you.
When you’re trying to be human-equivalent at writing text, you can just make up whatever output, and it’s now a human output because you’re human and you chose to output that.
GPT-4 is being asked to predict all that stuff you’re making up. It doesn’t get to make up whatever. It is being asked to model what you were thinking—the thoughts in your mind whose shadow is your text output—so as to assign as much probability as possible to your true next word.

The above example is only showing that the limit of performance on next-token prediction is far superhuman; it is not clear that scaling current methods will achieve anything close to this limit behavior.
There is a spectrum of things that could be going on under the hood in language models. You could consider the most capable end of this spectrum as to be “modeling the entire causal structure of the world.” On the lower end, you might find highly reductive views like “LLM’s are just stochastic parrots that repeat back things they’ve seen humans say on the internet.” (In fact, there are probably many dimensions along which we are unsure what language models are doing; I am just highlighting that I find this dimension of “amount of causal structure learned” useful.)
Then we should ask ourselves, where on this spectrum are current systems? Where do we expect near future systems to be?
- Language models definitely sometimes implement algorithms internally to perform tasks, rather than memorizing answers. The algorithm discovered for how GPT-2 small performs the task of indirect object identification in this paper is a concrete example of this.
- I’ve been unable to get GPT-4 to play tic-tac-toe well, and I know at least some other people have also tried and failed. This was surprising to me—it feels like the “causal structure” of tic-tac-toe is very simple, and GPT-4 can talk about strategies for tic-tac-toe, but when it actually plays it fails to keep track of what is going on and almost always loses, even when it is told to play carefully and reason out loud.
  - Of course, it could easily be finetuned to play tic-tac-toe well, but I think this can be interpreted as one data point for where the model lies on the “learned causal structure” spectrum, and it points in the direction of “not a lot.”
- That tic-tac-toe data point should not be taken as the whole picture by any means. GPT-4′s coding ability on tasks that require generalization from code that is on the internet is very impressive (at least in my opinion and the opinion of many others).
- I’d like to have a larger collection of important examples that help round out this picture of how much causal structure language models have learned. I have intuitions that I’ve built by playing around with LLM’s (mainly ChatGPT 3.5 and 4) and reading about behaviors other people have elicited, but I would like to distill them into a deeper understanding based on a clear list of examples that I feel are safe to interpret in specific ways.
Overall, I’m still quite uncertain about how much causal structure is being learned, but I tentatively think that it is a small enough amount that further algorithmic improvements and paradigm shifts are required before we achieve AGI. (This view of mine fluctuates rather often though.)
Even imitating humans can be very powerful. The question of how influential LLM-based systems can be if they are roughly human imitators is also one I find interesting. But I think this is probably quite different from AGI and superintelligence, and comes with considerably less X-risk.
Even if language models don’t get us AGI, we could have AGI pretty soon. I fully expect people to continue working on improving algorithms, and to continue to make progress.
Even if AGI is pretty far away, we should be working on safety and alignment now.
- This does affect the kind of work that seems most useful though. Things that are too heavily based on current systems look worse, things that are robust to paradigm shifts look better.