The First diagram really helped me understand the concept of K-composition. Infact composition of heads seems to be responsible for induction heads rather than the 2 layers themselves which many articles seem to claim.
Goutham Nalagatla
Karma: 12
The First diagram really helped me understand the concept of K-composition. Infact composition of heads seems to be responsible for induction heads rather than the 2 layers themselves which many articles seem to claim.
Good point.
At the data level, yes: because the sequence is periodic, an ideal algorithm could infer the block size and phase, then predict the rest from the first block. What I meant is that, for the transformer mechanisms I am testing, the useful operation still has to be implemented from context: the model must identify the earlier matching position or phase and use information from that earlier occurrence.
So the distinction I care about is not “copying” versus “period inference” as abstract algorithms, but which circuit the model actually uses. A pure positional/period shortcut should look different from an induction-style QK matching circuit, and that is why I also look at attention patterns, induction score, and ablations.