What are Transformers? Like what is concrete but accurate-enough conversational way of describing it that doesn’t force me to stop the conversation dead in it’s tracks to explain jargon like “Convolutional Neural Network” or “Multi-Head Attention”?
Its weird that I can tell you roughly how the Transformers in a Text Encoder-Decoder like T5 is different from the Autoregressive Transformers that generate the text in ChatGPT (T5 is parallel, ChatGPT sequential), or how I can even talk about ViT and DiT transformers in image synthesis (ViT like Stable Diffusion down and upsample the image performing operations on the entire latent, DiT work on patches). But I don’t actually have a clear definition for what is a transformer.
And if I was in a conversation with someone who doesn’t know much about compsci (i.e. Me—especially 5 months ago), how would I explain it:
“well for text models it is a mechanism that after a stream of words has been tokenized (i.e. blue might be “bl” and “ue” which each have a special id number) and the embeddings retrieved based on those token id numbers which are then used to compute the Query, Key, and Value vectors which often use a similarity measure like a Cosine Similarity Measure to compare the embedding of this key vector to the Qu—HEY, WHERE ARE YOU GOING! I DIDN’T EVEN GET TO THE NORMALIZATION!
Obviously this isn’t a definition, this is a “how it works” explanation and what I’ve just written as an example is heavily biased towards the decoder. But if someone asks me “what is a transformer?” what is a simple way of saying it in conversation?
Not for my purposes. For starters I use a lot of image and video generation, and even then you have U-nets and DITs so I need something more generalized. Also, if I’m not mistaken, what you’ve described is only applicable to autoregressive transformers like ChatGPT. Compare to say T5 which is not autoregressive.
What are Transformers? Like what is concrete but accurate-enough conversational way of describing it that doesn’t force me to stop the conversation dead in it’s tracks to explain jargon like “Convolutional Neural Network” or “Multi-Head Attention”?
Its weird that I can tell you roughly how the Transformers in a Text Encoder-Decoder like T5 is different from the Autoregressive Transformers that generate the text in ChatGPT (T5 is parallel, ChatGPT sequential), or how I can even talk about ViT and DiT transformers in image synthesis (ViT like Stable Diffusion down and upsample the image performing operations on the entire latent, DiT work on patches). But I don’t actually have a clear definition for what is a transformer.
And if I was in a conversation with someone who doesn’t know much about compsci (i.e. Me—especially 5 months ago), how would I explain it:
Obviously this isn’t a definition, this is a “how it works” explanation and what I’ve just written as an example is heavily biased towards the decoder. But if someone asks me “what is a transformer?” what is a simple way of saying it in conversation?
I think when explaining it to non-technical people, just saying something like “it’s a big next word predictor” is close enough to the truth to work.
Not for my purposes. For starters I use a lot of image and video generation, and even then you have U-nets and DITs so I need something more generalized. Also, if I’m not mistaken, what you’ve described is only applicable to autoregressive transformers like ChatGPT. Compare to say T5 which is not autoregressive.