Large language models learn to represent the world

There’s a nice recent paper whose authors did the following:

train a small GPT model on lists of moves from Othello games;
verify that it seems to have learned (in some sense) to play Othello, at least to the extent of almost always making legal moves;
use “probes” (regressors whose inputs are internal activations in the network, trained to output things you want to know whether the network “knows”) to see that the board state is represented inside the network activations;
use interventions to verify that this board state is being used to decide moves: take a position in which certain moves are legal, use gradient descent to find changes in internal activations that make the output of the probes look like a slightly different position, and then verify that when you run the network but tweak the activations as it runs the network predicts moves that are legal in the modified position.

In other words, it seems that their token-predicting model has built itself what amounts to an internal model of the Othello board’s state, which it is using to decide what moves to predict.

The paper is “Emergent world representations: Exploring a sequence model trained on a synthetic task” by Kenneth Li, Aspen Hopkins, David Bau, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg; you can find it at https://arxiv.org/abs/2210.13382.

There is a nice expository blog post by Kenneth Li at https://thegradient.pub/othello/.

Some details that seem possibly-relevant:

Their network has a 60-word input vocabulary (four of the 64 squares are filled when the game starts and can never be played in), 8 layers, an 8-head attention mechanism, and a 512-dimensional hidden space. (I don’t know enough about transformers to know whether this in fact tells you everything important about the structure.)
They tried training on two datasets, one of real high-level Othello games (about 140k games) and one of synthetic games where all moves are random (about 20M games). Their model trained on synthetic games predicted legal moves 99.99% of the time, but the one trained on real well-played games only predicted legal moves about 95% of the time. (This suggests that their network isn’t really big enough to capture legality and good strategy at the same time, I guess?)
They got some evidence that their network isn’t just memorizing game transcripts by training it on a 20M-game synthetic dataset where one of the four possible initial moves is never played. It still predicted legal moves 99.98% of the time when tested on the full range of legal positions. (I don’t know what fraction of legal positions are reachable with the first move not having been C4; it will be more than ³⁄₄ since there are transpositions. I doubt it’s close to 99.98%, though, so it seems like the model is doing pretty well at finding legal moves in positions it hasn’t seen.)
Using probes whose output is a linear function of the network activations doesn’t do a good job of reconstructing the board state (error rate is ~25%, barely better than attempting the same thing from a randomly initialized network), but training 2-layer MLPs to do it gets the error rate down to ~5% for the network trained on synthetic games and ~12% for the one trained on championship games, whereas it doesn’t help at all for the randomly trained network. (This suggests that whatever “world representation” the thing has learned isn’t simply a matter of having an “E3 neuron” or whatever.)

I am not at all an expert on neural network interpretability, and I don’t know to what extent their findings really justify calling what they’ve found a “world model” and saying that it’s used to make move predictions. In particular, I can’t refute the following argument:

“In most positions, just knowing what moves are legal is enough to give you a good idea of most of the board state. Anything capable of determining which moves are legal will therefore have a state from which the board state is somewhat reconstructible. This work really doesn’t tell us much beyond what the fact that the model could play legal moves already does. If the probes are doing something close to ‘reconstruct board state from legal moves’, then the interventions amount to ‘change the legal moves in a way that matches those available in the modified position’, which of course will make the model predict the moves that are available in the modified position.”

(It would be interesting to know whether their probes are more effective at reconstructing the board state in positions where the board state is closer to being determined by the legal moves. Though that seems like it would be hard to distinguish from “the model just works better earlier in the game”, which I suspect it does.)