Not for my purposes. For starters I use a lot of image and video generation, and even then you have U-nets and DITs so I need something more generalized. Also, if I’m not mistaken, what you’ve described is only applicable to autoregressive transformers like ChatGPT. Compare to say T5 which is not autoregressive.
I think when explaining it to non-technical people, just saying something like “it’s a big next word predictor” is close enough to the truth to work.
Not for my purposes. For starters I use a lot of image and video generation, and even then you have U-nets and DITs so I need something more generalized. Also, if I’m not mistaken, what you’ve described is only applicable to autoregressive transformers like ChatGPT. Compare to say T5 which is not autoregressive.