Paul Bogdan comments on Thought Anchors: Which LLM Reasoning Steps Matter?

Paul Bogdan 18 Aug 2025 20:07 UTC
1 point
0
To be clear, this is done independently for each head and each layer. As in, for a given attention head in a given layer, we will compute the vertical attention score for each sentence. The sentence-by-sentence attention scores define a vector for each head.
We then compute the kurtosis of that vector, and this kurtosis is our measure of the head’s “receiver-headness”. We use the kurtosis because that is the standard measure of tailedness. From Wikipedia:
kurtosis (...) refers to the degree of “tailedness” in the probability distribution of a real-valued random variable.
In this context, high tailedness means that attention is narrowed to some sentences. i.e., if you had 100 sentences and 99 sentences received zero attention while one sentence received lots of attention, then the kurtosis would be very high. This is what we want. We want to measure how much a given attention head narrows attention to particular sentences.
In Figure 4 of the paper, see the distribution for head 6 of layer 36, which is spikey; that distribution has a high kurtosis, whereas the non-spikey distributions have lower kurtoses.
- Realmbird 18 Aug 2025 21:10 UTC
  1 point
  0
  Parent
  How did you learn that vertical attention corresponded to sentences?
  - Paul Bogdan 22 Aug 2025 9:34 UTC
    1 point
    0
    Parent
    This was an assumption baked into the analysis, which specifically defined vertical attention scores as attention toward a sentence. We had some results showing that token-level vertical attention tended to be more similar to other vertical attention scores within-sentence rather than between-sentence, which supports this assumption. However, we don’t have any more formal results to report. However, even without such results, by looking at sentences, we are able to do analyses contrasting categories, which wouldn’t be possible with tokens.