To be clear, this is done independently for each head and each layer. As in, for a given attention head in a given layer, we will compute the vertical attention score for each sentence. The sentence-by-sentence attention scores define a vector for each head.
We then compute the kurtosis of that vector, and this kurtosis is our measure of the head’s “receiver-headness”. We use the kurtosis because that is the standard measure of tailedness. From Wikipedia:
In this context, high tailedness means that attention is narrowed to some sentences. i.e., if you had 100 sentences and 99 sentences received zero attention while one sentence received lots of attention, then the kurtosis would be very high. This is what we want. We want to measure how much a given attention head narrows attention to particular sentences.
In Figure 4 of the paper, see the distribution for head 6 of layer 36, which is spikey; that distribution has a high kurtosis, whereas the non-spikey distributions have lower kurtoses.
This was an assumption baked into the analysis, which specifically defined vertical attention scores as attention toward a sentence. We had some results showing that token-level vertical attention tended to be more similar to other vertical attention scores within-sentence rather than between-sentence, which supports this assumption. However, we don’t have any more formal results to report. However, even without such results, by looking at sentences, we are able to do analyses contrasting categories, which wouldn’t be possible with tokens.
To be clear, this is done independently for each head and each layer. As in, for a given attention head in a given layer, we will compute the vertical attention score for each sentence. The sentence-by-sentence attention scores define a vector for each head.
We then compute the kurtosis of that vector, and this kurtosis is our measure of the head’s “receiver-headness”. We use the kurtosis because that is the standard measure of tailedness. From Wikipedia:
In this context, high tailedness means that attention is narrowed to some sentences. i.e., if you had 100 sentences and 99 sentences received zero attention while one sentence received lots of attention, then the kurtosis would be very high. This is what we want. We want to measure how much a given attention head narrows attention to particular sentences.
In Figure 4 of the paper, see the distribution for head 6 of layer 36, which is spikey; that distribution has a high kurtosis, whereas the non-spikey distributions have lower kurtoses.
How did you learn that vertical attention corresponded to sentences?
This was an assumption baked into the analysis, which specifically defined vertical attention scores as attention toward a sentence. We had some results showing that token-level vertical attention tended to be more similar to other vertical attention scores within-sentence rather than between-sentence, which supports this assumption. However, we don’t have any more formal results to report. However, even without such results, by looking at sentences, we are able to do analyses contrasting categories, which wouldn’t be possible with tokens.