Filler tokens don’t allow for serially deeper cognition than what architectural limits allow (n-layers of processing), but they could totally allow for solving a higher fraction of “heavily serial” reasoning tasks[[1]] insofar as the LLM could still benefit from more parallel processing. For instance, the AI might by default be unable to do some serial step within 3 layers but can do that step within 3 layers if it can parallelize this over a bunch of filler tokens. This functionally could allow for more serial depth unless the AI is strongly bottlenecked on serial depth with no way for more layers to help (e.g., the shallowest viable computation graph has depth K and K is greater than the number of layers and the LLM can’t do multiple nodes in a single layer[[2]]).
Filler tokens don’t allow for serially deeper cognition than what architectural limits allow
This depends on your definition of serial cognition, under the definitions I like most the serial depth scales logarithmically with the number of tokens. This is because as you increase parallelism (in the sense you use above), that also increases serial depth logarithmically.
The basic intuitions for this are:
If you imagine mechanistically how you would add N numbers together, it seems like you would need logarithmic depth (where you recursively split it in half, compute the sums of each half, and then add the results together). Note that attention heads compute sums of numbers that scale with the number of tokens.
There are physical limits to the total amount of local computation that can be done in a given amount of time, due to speed-of-light constraints. So inasmuch as “serial depth” is supposed to capture intuitively what computation you can do with “time”, it seems like serial depth should increase as total computation goes up (reflecting the fact that you have to break up the computation into local pieces, as in the addition example above).
I agree with this, and I think LLMs already do this for non-filler tokens. A sufficiently deep LLM could wait and do all of its processing right as it generates a token, but in practice they start thinking about a problem as they read it.
For example, if I ask an LLM “I have a list of [Paris, Washington DC, London, …], what country is the 2nd item in the list in?”, the LLM has likely already parallelized the country lookup while it was reading the city names. If you prevent the LLM from parallelizing (“List the top N capitals by country size in your head and output the 23rd item in the list”), it will do much worse, even though the serial depth is small.
Filler tokens don’t allow for serially deeper cognition than what architectural limits allow (n-layers of processing), but they could totally allow for solving a higher fraction of “heavily serial” reasoning tasks [[1]] insofar as the LLM could still benefit from more parallel processing. For instance, the AI might by default be unable to do some serial step within 3 layers but can do that step within 3 layers if it can parallelize this over a bunch of filler tokens. This functionally could allow for more serial depth unless the AI is strongly bottlenecked on serial depth with no way for more layers to help (e.g., the shallowest viable computation graph has depth K and K is greater than the number of layers and the LLM can’t do multiple nodes in a single layer [[2]] ).
Or at least reasoning tasks that seem heavily serial to humans.
With a wide enough layer, you can technically do anything, but this can quickly get extremely impractical.
This depends on your definition of serial cognition, under the definitions I like most the serial depth scales logarithmically with the number of tokens. This is because as you increase parallelism (in the sense you use above), that also increases serial depth logarithmically.
The basic intuitions for this are:
If you imagine mechanistically how you would add N numbers together, it seems like you would need logarithmic depth (where you recursively split it in half, compute the sums of each half, and then add the results together). Note that attention heads compute sums of numbers that scale with the number of tokens.
There are physical limits to the total amount of local computation that can be done in a given amount of time, due to speed-of-light constraints. So inasmuch as “serial depth” is supposed to capture intuitively what computation you can do with “time”, it seems like serial depth should increase as total computation goes up (reflecting the fact that you have to break up the computation into local pieces, as in the addition example above).
I agree with this, and I think LLMs already do this for non-filler tokens. A sufficiently deep LLM could wait and do all of its processing right as it generates a token, but in practice they start thinking about a problem as they read it.
For example, if I ask an LLM “I have a list of [Paris, Washington DC, London, …], what country is the 2nd item in the list in?”, the LLM has likely already parallelized the country lookup while it was reading the city names. If you prevent the LLM from parallelizing (“List the top N capitals by country size in your head and output the 23rd item in the list”), it will do much worse, even though the serial depth is small.