That context-less intermediate tokens improve LLM performance isn’t surprising, given the theoretical analysis showing that the generation of intermediate tokens allow constant-size transformers to solve more complex problems—see Denny Zhou’s presentation.
It is indeed curious that recent LLMs have significantly improved performance compared to older ones. But I wonder if there’s a better, different explanation than “meta-cognition”.
That context-less intermediate tokens improve LLM performance isn’t surprising, given the theoretical analysis showing that the generation of intermediate tokens allow constant-size transformers to solve more complex problems—see Denny Zhou’s presentation.
It is indeed curious that recent LLMs have significantly improved performance compared to older ones. But I wonder if there’s a better, different explanation than “meta-cognition”.
Note that this linked video is talking about tokens that the LLM picks rather than fixed tokens.