jacob_drori comments on jacob_drori’s Shortform

jacob_drori 10 Oct 2025 18:13 UTC
27 points
−1
LLMs linearly represent the accuracy of their next-token predictions
A quick experiment I did on Llama-3.2-1B:
- Choose a layer, and let ${\to x}_{t}$ be that layer’s residual activation at token position $t$ .
- Let $l_{t}^{(correct)}$ be the logit at position $t$ that was assigned to whatever turned out to be the correct next token (i.e. the correct token at position $t + 1$ ).
- Use least-squares to fit a linear map that takes ${\to x}_{t}$ and predicts $l_{t - 1}^{(correct)}$ .
Results for each layer are below. Layers in the second half of the model (except the last layer) have decent ( $R^{2} > 0.6$ ) linear representations of the correct logit.
I don’t find this result very surprising. I checked it because I thought it might explain what’s going on in this very cool recent paper by @Dmitrii Krasheninnikov , Richard Turner and @David Scott Krueger (formerly: capybaralet)). They finetune Llama-3.2-1B and show that the model linearly represents the order of appearance of finetuning data. If more recent data has lower loss, then perhaps my probing results explain theirs.
- Gurkenglas 11 Oct 2025 0:24 UTC
  6 points
  5
  Parent
  What don’t LLMs linearly represent?
  - jacob_drori 11 Oct 2025 21:20 UTC
    1 point
    0
    Parent
    I really thought I’d updated all the way towards “models linearly represent everything”. But somehow I was still surprised by Dmitrii’s training order result.