LLMs linearly represent the accuracy of their next-token predictions
A quick experiment I did on Llama-3.2-1B:
Choose a layer, and let →xt be that layer’s residual activation at token position t.
Let l(correct)t be the logit at position t that was assigned to whatever turned out to be the correct next token (i.e. the correct token at position t+1).
Use least-squares to fit a linear map that takes →xt and predicts l(correct)t−1.
Results for each layer are below. Layers in the second half of the model (except the last layer) have decent (R2>0.6) linear representations of the correct logit.
I don’t find this result very surprising. I checked it because I thought it might explain what’s going on in this very cool recent paper by @Dmitrii Krasheninnikov , Richard Turner and @David Scott Krueger (formerly: capybaralet)). They finetune Llama-3.2-1B and show that the model linearly represents the order of appearance of finetuning data. If more recent data has lower loss, then perhaps my probing results explain theirs.
I really thought I’d updated all the way towards “models linearly represent everything”. But somehow I was still surprised by Dmitrii’s training order result.
LLMs linearly represent the accuracy of their next-token predictions
A quick experiment I did on Llama-3.2-1B:
Choose a layer, and let →xt be that layer’s residual activation at token position t.
Let l(correct)t be the logit at position t that was assigned to whatever turned out to be the correct next token (i.e. the correct token at position t+1).
Use least-squares to fit a linear map that takes →xt and predicts l(correct)t−1.
Results for each layer are below. Layers in the second half of the model (except the last layer) have decent (R2>0.6) linear representations of the correct logit.
I don’t find this result very surprising. I checked it because I thought it might explain what’s going on in this very cool recent paper by @Dmitrii Krasheninnikov , Richard Turner and @David Scott Krueger (formerly: capybaralet)). They finetune Llama-3.2-1B and show that the model linearly represents the order of appearance of finetuning data. If more recent data has lower loss, then perhaps my probing results explain theirs.
What don’t LLMs linearly represent?
I really thought I’d updated all the way towards “models linearly represent everything”. But somehow I was still surprised by Dmitrii’s training order result.