In general, computing a boolean expression with terms without the signal being drowned out by the noise will require if the noise is correlated, and if the noise is uncorrelated.
Shouldn’t the second one be ?
the past token can use information that is not accessible to the token generating the key (as it is in its “future” – this is captured e.g. by the attention mask)
Is this meant to say “last token” instead of “past token”?
I sympathize somewhat with this complexity point but I’m worried that training will be extremely non-Bayesian in a way that makes complexity arguments not really work. So I feel like the point about optimization power at best cuts the worry about hyperstition by about a factor of 2. Perhaps there should be research on how “sticky” the biases from early in training can be in the face of later optimization pressure.