At the risk of being too vague to be understood… you can always factorise a probability distribution as P(X1)P(X2|X1) etc, so plain next token prediction should be able to do the job, but maybe there’s a more natural “causal” factorisation that goes like P(subject)P(verb|subject) etc, which is not ordered the same as the tokens but from which we can derive the token probabilities, and maybe that’s easier to learn than the raw next token probabilities.
At the risk of being too vague to be understood… you can always factorise a probability distribution as P(X1)P(X2|X1) etc, so plain next token prediction should be able to do the job, but maybe there’s a more natural “causal” factorisation that goes like P(subject)P(verb|subject) etc, which is not ordered the same as the tokens but from which we can derive the token probabilities, and maybe that’s easier to learn than the raw next token probabilities.
I’ve no idea if this is what gwern meant.