I’m not tailcalled, but yeah, it being (containing?) a transformer does make it pretty similar architecturally. Autoregressive transformers predict one output (e.g. a token) at a time. But lots of transformers (like some translation models) are sequence-to-sequence, so they take in a whole passage and output a whole passage.
There are differences, but iirc it’s mostly non-autoregressive transformers having some extra parts that autoregressive ones don’t need. Lots of overlap though. More like a different breed than a different species.
Is this just a semantic quibble, or are you saying there’s fundamental similarities between them that are relevant?
I’m not tailcalled, but yeah, it being (containing?) a transformer does make it pretty similar architecturally. Autoregressive transformers predict one output (e.g. a token) at a time. But lots of transformers (like some translation models) are sequence-to-sequence, so they take in a whole passage and output a whole passage.
There are differences, but iirc it’s mostly non-autoregressive transformers having some extra parts that autoregressive ones don’t need. Lots of overlap though. More like a different breed than a different species.
Diffusion LLMs and autoregressive LLMs seem like basically the same technology to me.
Agreed. I highly recommend this blog post (https://sander.ai/2024/09/02/spectral-autoregression.html) for concretely understanding why autoregressive and diffusion models are so similar, despite seeming so different.