Pre-registering a prediction for experiment results testing why and when and how attention heads share information. We (simplex) will train transformers on data generated from the cartesian product of sequences generated from 2 independent Mess3 processes. If the degenerate eigenvalue of each is the same and positive (e.g. lambda=0.7 for both) then a transformer with a single attention head will learn it, and the attention patterns will be 0.7^(d-s) where d-s is the number of context positions between the source and desitination. If instead one of the processes has lambda = 0.7 and another lambda = 0.3, then this will seperate info. between two heads, where one has attention pattersn of 0.7^(d-s) and the other 0.3^(d-s). If instead we have one process with lambda=0.7 and the other with lambda = −0.3, then this will require 3 heads, with .7^(d-s) and the 0.3^(d-s) seperating between two heads to account for the fact that attention has to be positive. BUT if we have one process with lambda = 0.7 and the other with lambda = −0.7, then this will require 2 heads! With the 0.7^(d-s) being able to be shared on one head for the lambda=0.7 process, and for the −0.7 process when d-s is even, and then the odd d-s cases will be on the other head.
Pre-registering a prediction for experiment results testing why and when and how attention heads share information. We (simplex) will train transformers on data generated from the cartesian product of sequences generated from 2 independent Mess3 processes. If the degenerate eigenvalue of each is the same and positive (e.g. lambda=0.7 for both) then a transformer with a single attention head will learn it, and the attention patterns will be 0.7^(d-s) where d-s is the number of context positions between the source and desitination. If instead one of the processes has lambda = 0.7 and another lambda = 0.3, then this will seperate info. between two heads, where one has attention pattersn of 0.7^(d-s) and the other 0.3^(d-s). If instead we have one process with lambda=0.7 and the other with lambda = −0.3, then this will require 3 heads, with .7^(d-s) and the 0.3^(d-s) seperating between two heads to account for the fact that attention has to be positive. BUT if we have one process with lambda = 0.7 and the other with lambda = −0.7, then this will require 2 heads! With the 0.7^(d-s) being able to be shared on one head for the lambda=0.7 process, and for the −0.7 process when d-s is even, and then the odd d-s cases will be on the other head.