Not that unlike GPT-2, GPT-3 does use some sparse attention. The GPT-3 paper says the model uses “alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer”.
That’s true, but for the long run behavior, the more expensive dense attention layers should still dominate, I think.
Not that unlike GPT-2, GPT-3 does use some sparse attention. The GPT-3 paper says the model uses “alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer”.
That’s true, but for the long run behavior, the more expensive dense attention layers should still dominate, I think.