I agree with this, and I think LLMs already do this for non-filler tokens. A sufficiently deep LLM could wait and do all of its processing right as it generates a token, but in practice they start thinking about a problem as they read it.
For example, if I ask an LLM “I have a list of [Paris, Washington DC, London, …], what country is the 2nd item in the list in?”, the LLM has likely already parallelized the country lookup while it was reading the city names. If you prevent the LLM from parallelizing (“List the top N capitals by country size in your head and output the 23rd item in the list”), it will do much worse, even though the serial depth is small.
I agree with this, and I think LLMs already do this for non-filler tokens. A sufficiently deep LLM could wait and do all of its processing right as it generates a token, but in practice they start thinking about a problem as they read it.
For example, if I ask an LLM “I have a list of [Paris, Washington DC, London, …], what country is the 2nd item in the list in?”, the LLM has likely already parallelized the country lookup while it was reading the city names. If you prevent the LLM from parallelizing (“List the top N capitals by country size in your head and output the 23rd item in the list”), it will do much worse, even though the serial depth is small.