Stephen McAleese comments on GPT-4: What we (I) know about it

Stephen McAleese 16 Mar 2023 8:36 UTC
1 point
0
Not that unlike GPT-2, GPT-3 does use some sparse attention. The GPT-3 paper says the model uses “alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer”.
- Robert_AIZI 16 Mar 2023 12:40 UTC
  1 point
  0
  Parent
  That’s true, but for the long run behavior, the more expensive dense attention layers should still dominate, I think.