tailcalled comments on Gemini Diffusion: watch this space

tailcalled 21 May 2025 17:08 UTC
49 points
27
Text diffusion models are still LLMs, just not autoregressive.
- Yair Halberstadt 21 May 2025 18:08 UTC
  11 points
  2
  Parent
  Is this just a semantic quibble, or are you saying there’s fundamental similarities between them that are relevant?
  - JustisMills 21 May 2025 19:11 UTC
    31 points
    9
    Parent
    I’m not tailcalled, but yeah, it being (containing?) a transformer does make it pretty similar architecturally. Autoregressive transformers predict one output (e.g. a token) at a time. But lots of transformers (like some translation models) are sequence-to-sequence, so they take in a whole passage and output a whole passage.
    There are differences, but iirc it’s mostly non-autoregressive transformers having some extra parts that autoregressive ones don’t need. Lots of overlap though. More like a different breed than a different species.
  - tailcalled 22 May 2025 6:59 UTC
    6 points
    4
    Parent
    Diffusion LLMs and autoregressive LLMs seem like basically the same technology to me.
    - Michael Liu 25 May 2025 5:26 UTC
      3 points
      0
      Parent
      Agreed. I highly recommend this blog post (https://sander.ai/2024/09/02/spectral-autoregression.html) for concretely understanding why autoregressive and diffusion models are so similar, despite seeming so different.
- Mikola Lysenko 9 Jun 2025 12:53 UTC
  1 point
  0
  Parent
  I disagree. In practice diffusion models are autoregressive for generating non-trivial amounts of text. A better way to think about diffusion models is that they are a generalization of multi-token prediction (similar to how DeepSeek does it) where the number of tokens you get to predict in 1 shot is controllable and steerable. If you do use a diffusion model over a larger generation you will end up running it autoregressively, and in the limit you could make them work like a normal 1-token-at-a-time LLM or do up to 1-big-batch-of-N-tokens at a time.
  - tailcalled 9 Jun 2025 13:14 UTC
    2 points
    0
    Parent
    Point is they’re still LLMs.
- dr_s 25 May 2025 17:47 UTC
  1 point
  −2
  Parent
  Yeah, the thing they aren’t is transformers.
  
  EDIT: I stand corrected, I tended to think of diffusion models as necessarily a classic series of convolution/neural network layers but obviously that’s just for images and there’s no reason to not use a transformer approach instead, so I realise the two things are decoupled, and what makes a diffusion model is its training objective, not its architecture.
  - mattmacdermott 26 May 2025 11:59 UTC
    7 points
    2
    Parent
    You can train transformers as diffusion models (example paper), and that’s presumably what Gemini diffusion is.
    - dr_s 27 May 2025 8:23 UTC
      2 points
      0
      Parent
      Fair, you can use the same architecture just fine instead of simple NNs. It’s really a distinction between what’s your choice of universal function approximator vs what goal you optimise it against I guess.
  - the gears to ascension 26 May 2025 15:58 UTC
    4 points
    0
    Parent
    the thing they aren’t is one-step crossentropy. that’s it, everything else is presumably sampled from the same distribution as existing LLMs. (this is like if someone finally upgraded BERT to be a primary model).