Max Niederman comments on Max Niederman’s Shortform

Max Niederman 8 Aug 2025 22:16 UTC
35 points
7
Transformers do not natively operate on sequences.

This was a big misconception I had because so much of the discussion around transformers is oriented around predicting sequences. However, it’s more accurate to think of general transformers as operating on unordered sets of tokens. The understanding of sequences only comes if you have a positional embedding to tell the transformer how the tokens are ordered, and possibly a causal mask to force attention to flow in only one direction.
- Adam Shai 9 Aug 2025 18:07 UTC
  3 points
  0
  Parent
  In some sense I agree but I do think it’s more nuanced than that in practice. Once you add in cross-entropy loss on next token prediction, alongside causal masking, you really do get a strong sense of “operating on sequences”. This is because next token prediction is fundamentally sequential in nature in that the entire task is to make use of the correlational structure of sequences of data in order to predict future sequences.
- samuelshadrach 9 Aug 2025 12:14 UTC
  2 points
  0
  Parent
  Has anyone ever trained a transformer that doesn’t suck, without positional information (such as positional embedding or causal mask)?
  - Max Niederman 9 Aug 2025 21:54 UTC
    2 points
    0
    Parent
    There’s the atom transformer in AlphaFold-like architectures, although the embeddings it operates on do encode 3D positioning from earlier parts of the model so maybe that doesn’t count.