In some sense I agree but I do think it’s more nuanced than that in practice. Once you add in cross-entropy loss on next token prediction, alongside causal masking, you really do get a strong sense of “operating on sequences”. This is because next token prediction is fundamentally sequential in nature in that the entire task is to make use of the correlational structure of sequences of data in order to predict future sequences.
In some sense I agree but I do think it’s more nuanced than that in practice. Once you add in cross-entropy loss on next token prediction, alongside causal masking, you really do get a strong sense of “operating on sequences”. This is because next token prediction is fundamentally sequential in nature in that the entire task is to make use of the correlational structure of sequences of data in order to predict future sequences.