Alice Blair comments on Gemini Diffusion: watch this space

Alice Blair 20 May 2025 21:10 UTC
7 points
1
This seems like it’s only a big deal if we expect diffusion language models to scale at a pace comparable or better to more traditional autoregressive language transformers, which seems non-obvious to me.

Right now my distribution over possible scaling behaviors is pretty wide, I’m interested to hear more from people.
- J Bostock 20 May 2025 21:17 UTC
  14 points
  7
  Parent
  My understanding was that diffusion refers to a training objective, and isn’t tied to a specific architecture. For example OpenAI’s Sora is described as a diffusion transformer. Do you mean you expect diffusion transformers to scale worse than autoregressive transformers? Or do you mean you don’t think this model is a transformer in terms of architecture.
  - Alice Blair 20 May 2025 21:49 UTC
    6 points
    0
    Parent
    Oops, I wrote that without fully thinking about diffusion models. I meant to contrast diffusion LMs to more traditional autoregressive language transformers, yes. Thanks for the correction, I’ll clarify my original comment.
- wassname 23 May 2025 7:27 UTC
  3 points
  0
  Parent
  If it’s trained from scratch, and they release details, then it’s one data point for diffusion LLM scaling. But if it’s distilled, then it’s zero points of scaling data.
  
  Because we are not interested in scaling which is distilled from a larger parent model, as that doesn’t push the frontier because it doesn’t help get the next larger parent model.
  
  Apple also have LLM diffusion papers, with code. It seems like it might be helpful for alignment and interp because it would have a more interpretable and manipulable latent space.
  - eggsyntax 23 May 2025 14:17 UTC
    4 points
    0
    Parent
    It seems like it might be helpful for alignment and interp because it would have a more interpretable and manipulable latent space.
    Why would we expect that to be the case? (If the answer is in the Apple paper, just point me there)
    - wassname 8 Jun 2025 21:55 UTC
      1 point
      0
      Parent
      Oh it’s not explicitly in the paper, but in Apple’s version they have an encoder/decoder with explicit latent space. This space would be much easier to work with and steerable than the hidden states we have in transformers.
      
      With an explicit and nicely behaved latent space we would have a much better chance of finding a predictive “truth” neuron where intervention reveals deception 99% of the time even out of sample. Right now mechinterp research achieves much less, partly because the transformers have quite confusing activation spaces (attention sinks, suppressed neurons, etc).
      - eggsyntax 10 Jun 2025 15:45 UTC
        3 points
        1
        Parent
        I think what you’re saying is that because the output of the encoder is a semantic embedding vector per paragraph, that results in a coherent latent space that probably has nice algebraic properties (in the same sense that eg the Word2Vec embedding space does). Is that a good representation?
        That does seem intuitively plausible, although I could also imagine that there might have to be some messy subspaces for meta-level information, maybe eg ‘I’m answering in language X, with tone Y, to a user with inferred properties Z’. I’m looking forward to seeing some concrete interpretability work on these models.
        wassname 19 Jun 2025 7:25 UTC
        3 points
        0
        Parent
        Yes, that’s exactly what I mean! If we have word2vec like properties, steering and interpretability would be much easier and more reliable. And I do think it’s a research direction that is prospective, but not certain.
        
        Facebook also did an interesting tokenizer, that makes LLM’s operating in a much richer embeddings space: https://github.com/facebookresearch/blt. They embed sentences split by entropy/surprise. So it might be another way to test the hypothesis that a better embedding space would provide ice Word2Vec like properties.
  - StanislavKrym 23 May 2025 17:02 UTC
    1 point
    1
    Parent
    If Gemini is distilled from a bigger LLM, then it’s also useful because a similar result is obtained with fewer compute. Consider o3 and o4-mini which is only a little less powerful and far cheaper. And that’s ignoring the possibility to amplify Gemini Diffusion, then re-distill it, obtaining GemDiff^2, etc. If this IDA process turns out to be far cheaper than that of LLMs, then we obtain a severe capabilities per compute increase...
    - wassname 8 Jun 2025 21:59 UTC
      1 point
      0
      Parent
      Good point! And it’s plausible because diffusion seems to provide more supervision and get better results in generative vision models, so it’s a candidate for scaling.
- ConcurrentSquared 1 Jun 2025 2:00 UTC
  2 points
  1
  Parent
  This seems like it’s only a big deal if we expect diffusion language models to scale at a pace comparable or better to more traditional autoregressive language transformers, which seems non-obvious to me.
  There are some use-cases where quick and precise inference is vital: for example, many agentic tasks (like playing most MOBAs or solving a physical Rubik’s cube; debatably most non-trivial physical tasks) require quick, effective, and multi-step reasoning.
  Current LLMs can’t do many of these tasks for a multitude of reasons; one of those reasons is the speed that it takes to generate responses, especially with chain-of-thought reasoning. A diffusion-based LLM could actually respond to novel events quickly, using a superbly detailed chain-of-thought, with only ‘commodity’ and therefore cheaper hardware (no WSL chips or other weirdness, only GPUs).
  If non-trivial physical tasks (like automatically collecting and doing laundry) require detailed COTs (somewhat probable, 60%), and these tasks are very economically relevant (this seems highly probable to me, 80%), then the economic utility to training diffusion LLMs only requires said diffusion LLMs to have near-comparable scaling to traditional autoregressive LLMs; the economic use cases for fast inference more than justifies the required and higher training requirements (~48%).
  - gwern 3 Jun 2025 19:08 UTC
    6 points
    0
    Parent
    
    There are some use-cases where quick and precise inference is vital: for example, many agentic tasks (like playing most MOBAs or solving a physical Rubik’s cube; debatably most non-trivial physical tasks) require quick, effective, and multi-step reasoning.
    
    Yeah, diffusion LLMs could be important not for being better at predicting what action to take, but for hitting real-time latency constraints, because they intrinsically amortize their computation more cleanly over steps. This is part of why people were exploring diffusion models in RL: a regular bidirectional or unidirectional LLM tends to be all-or-nothing, in terms of the forward pass, so even if you are doing the usual optimization tricks, it’s heavyweight. A diffusion model lets you stop in the middle of the diffusing, or use that diffusion step to improve other parts, or pivot to a new output entirely.
    
    A diffusion LLM in theory can do something like plan a sequence of future actions+states in addition to the token about to be executed, and so each token can be the result of a bunch of diffusion steps from a long time ago. This allows a small fast model to make good use of ‘easy’ timesteps to refine its next action: it just spends the compute to keep refining its model of the future and what it ought to do next, so at the next timestep, the action is ‘already predicted’ (if things were going according to plan). If something goes wrong, then the existing sequence may still be an efficient starting point compared to a blank slate, and quickly update to compensate. And this is quite natural compared to trying to bolt on something to do with MoEs or speculative decoding or something.
    
    So your robot diffusion LLM can be diffusing a big context of thousands of tokens, which represents its plan and predicted environment observations over the next couple seconds, and each timestep, it does a little more thinking to tweak each token a little bit, and despite this being only a few milliseconds of thinking each time by a small model, it eventually turns into a highly capable robot model’s output and each action-token is ready by the time it’s necessary (and even if it’s not fully done, at least it is there to be executed—a low-quality action choice is often better than blowing the deadline and doing some default action like a no-op). You could do the same thing with a big classic GPT-style LLM, but the equivalent quality forward pass might take 100ms and now it’s not fast enough for good robotics (without spending a lot of time on expensive hardware or optimizing).
- Yair Halberstadt 21 May 2025 1:48 UTC
  0 points
  0
  Parent
  For sure. It might be nothing, or it might be everything.