I’m not tailcalled, but yeah, it being (containing?) a transformer does make it pretty similar architecturally. Autoregressive transformers predict one output (e.g. a token) at a time. But lots of transformers (like some translation models) are sequence-to-sequence, so they take in a whole passage and output a whole passage.
There are differences, but iirc it’s mostly non-autoregressive transformers having some extra parts that autoregressive ones don’t need. Lots of overlap though. More like a different breed than a different species.
I’m not tailcalled, but yeah, it being (containing?) a transformer does make it pretty similar architecturally. Autoregressive transformers predict one output (e.g. a token) at a time. But lots of transformers (like some translation models) are sequence-to-sequence, so they take in a whole passage and output a whole passage.
There are differences, but iirc it’s mostly non-autoregressive transformers having some extra parts that autoregressive ones don’t need. Lots of overlap though. More like a different breed than a different species.