I’m a bit confused by what’s going on with the paper claiming their empirical results support innovations being compute-dependent when they only test MQA (and IMO show unclear results in this case). It almost seems like they forgot to include results or didn’t realize they only tested MQA because (e.g.) they talk about what their empirical results generally found for compute-dependent advancements (a class of 1), and the section on Sparse Attention links the empirical results section despite Sparse Attention not actually appearing in the empirical results section as far as I can tell!
I’m a bit confused by what’s going on with the paper claiming their empirical results support innovations being compute-dependent when they only test MQA (and IMO show unclear results in this case). It almost seems like they forgot to include results or didn’t realize they only tested MQA because (e.g.) they talk about what their empirical results generally found for compute-dependent advancements (a class of 1), and the section on Sparse Attention links the empirical results section despite Sparse Attention not actually appearing in the empirical results section as far as I can tell!