I don’t think this plot shows what you claim it shows. This looks like no specialication of long range v.s. short range to me.
My main argument for this interpretataion is that the green and red lines move in almost perfect syncoronisation. This shows that attending to tokens that are 5-10 tokes away is done in the same layers as attending to tokens that are 0-5 tokens away. The fact that the blue line dropps more sharply only shows that close context is very important, not that it happens first, given that all three lines start dropping right away.
What it looks like to me:
In layers 0-10, the model is gradualy taking in more and more context. Short range context is generaly more imortant in determining the recidual activations (i.e. more truncation → larger cosine diffrerence), but there is no particular layer specialisation. Blue line bottoms out earlier, but this looks like cealing effect (floor effect?) to me.
In layers 10-14 the network does some in-place prosessing.
In layers 15-29 the network reads from other tokens again.
in the last two layers the netowok finalise it’s next token prediction.
(Very low confidence on 2-4, since the effects from previous lack of context can be amplified in later in-place prosessing, wich would confound any intepretation of the grah in later layers.)
Sorry I don’t quite understand your arguments here. Five tokens back and 10 tokens back are both within what I would consider short-range context EG. They are typically within the same sentence. I’m not saying there are layers that look at the most recent five tokens and then other layers that look at the most recent 10 tokens etc. It’s more that early layers aren’t looking 100 tokens back as much. The lines being parallelish is consistent with our hypothesis
I was imagining “local” meaning below 5 or 10 tokens away, partly anchored on the example of detokenisation, from the previous posts in the sequence, but also because that’s what you’re looking at. If your definition of “local” is longer than 10 tokens, then I’m confused why you didn’t show the results for longer trunkations. I though the plot was to show what happens if you include the local context but cut the rest.
Even if there is specialisation going on between local and long range, I don’t expect a sharp cutoff what is local v.s. non-local (and I assume neither do you). If some such soft boundary exists and it where in the 5-10 range, then I’d expect the 5 and 10 context lines to not be so correlated. But if you think the soft boundary is further away, then I agree that this correlation dosn’t say much.
Attemting to re-state what I read from the graph: Looking at the green line, the fact that most of the drop in cosine similarity for t is in the early layers, suggests that longer range attention (more than 10 tokens away), is mosly located in the early layers. The fact that the blue and red line has their larges drops in the same regions, suggest that short-ish (5-10) and very short (0-5) attention is also mostly located there. I.e. the graph does not give evidence of range specialication for diffrent attention layes.
Did you also look at the statistics of attention distance for the attention patherns of various attention heads? I think that would be an easier way to settle this. Although mayne there is some techical dificulty in ruling out irrelevant attention that is just an artifact of attention needing to add up to one?
I don’t think this plot shows what you claim it shows. This looks like no specialication of long range v.s. short range to me.
My main argument for this interpretataion is that the green and red lines move in almost perfect syncoronisation. This shows that attending to tokens that are 5-10 tokes away is done in the same layers as attending to tokens that are 0-5 tokens away. The fact that the blue line dropps more sharply only shows that close context is very important, not that it happens first, given that all three lines start dropping right away.
What it looks like to me:
In layers 0-10, the model is gradualy taking in more and more context. Short range context is generaly more imortant in determining the recidual activations (i.e. more truncation → larger cosine diffrerence), but there is no particular layer specialisation. Blue line bottoms out earlier, but this looks like cealing effect (floor effect?) to me.
In layers 10-14 the network does some in-place prosessing.
In layers 15-29 the network reads from other tokens again.
in the last two layers the netowok finalise it’s next token prediction.
(Very low confidence on 2-4, since the effects from previous lack of context can be amplified in later in-place prosessing, wich would confound any intepretation of the grah in later layers.)
Sorry I don’t quite understand your arguments here. Five tokens back and 10 tokens back are both within what I would consider short-range context EG. They are typically within the same sentence. I’m not saying there are layers that look at the most recent five tokens and then other layers that look at the most recent 10 tokens etc. It’s more that early layers aren’t looking 100 tokens back as much. The lines being parallelish is consistent with our hypothesis
Thanks for responding.
I was imagining “local” meaning below 5 or 10 tokens away, partly anchored on the example of detokenisation, from the previous posts in the sequence, but also because that’s what you’re looking at. If your definition of “local” is longer than 10 tokens, then I’m confused why you didn’t show the results for longer trunkations. I though the plot was to show what happens if you include the local context but cut the rest.
Even if there is specialisation going on between local and long range, I don’t expect a sharp cutoff what is local v.s. non-local (and I assume neither do you). If some such soft boundary exists and it where in the 5-10 range, then I’d expect the 5 and 10 context lines to not be so correlated. But if you think the soft boundary is further away, then I agree that this correlation dosn’t say much.
Attemting to re-state what I read from the graph: Looking at the green line, the fact that most of the drop in cosine similarity for t is in the early layers, suggests that longer range attention (more than 10 tokens away), is mosly located in the early layers. The fact that the blue and red line has their larges drops in the same regions, suggest that short-ish (5-10) and very short (0-5) attention is also mostly located there. I.e. the graph does not give evidence of range specialication for diffrent attention layes.
Did you also look at the statistics of attention distance for the attention patherns of various attention heads? I think that would be an easier way to settle this. Although mayne there is some techical dificulty in ruling out irrelevant attention that is just an artifact of attention needing to add up to one?