Much is made of the fact that LLMs are ‘just’ doing next-token prediction. But there’s an important sense in which that’s all we’re doing—through a predictive processing lens, the core thing our brains are doing is predicting the next bit of input from current input + past input. In our case input is multimodal; for LLMs it’s tokens. There’s an important distinction in that LLMs are not (during training) able to affect the stream of input, and so they’re myopic in a way that we’re not. But as far as the prediction piece, I’m not sure there’s a strong difference in kind.
Much is made of the fact that LLMs are ‘just’ doing next-token prediction. But there’s an important sense in which that’s all we’re doing—through a predictive processing lens, the core thing our brains are doing is predicting the next bit of input from current input + past input. In our case input is multimodal; for LLMs it’s tokens. There’s an important distinction in that LLMs are not (during training) able to affect the stream of input, and so they’re myopic in a way that we’re not. But as far as the prediction piece, I’m not sure there’s a strong difference in kind.
Would you disagree? If so, why?