Filler tokens don’t allow for serially deeper cognition than what architectural limits allow
This depends on your definition of serial cognition, under the definitions I like most the serial depth scales logarithmically with the number of tokens. This is because as you increase parallelism (in the sense you use above), that also increases serial depth logarithmically.
The basic intuitions for this are:
If you imagine mechanistically how you would add N numbers together, it seems like you would need logarithmic depth (where you recursively split it in half, compute the sums of each half, and then add the results together). Note that attention heads compute sums of numbers that scale with the number of tokens.
There are physical limits to the total amount of local computation that can be done in a given amount of time, due to speed-of-light constraints. So inasmuch as “serial depth” is supposed to capture intuitively what computation you can do with “time”, it seems like serial depth should increase as total computation goes up (reflecting the fact that you have to break up the computation into local pieces, as in the addition example above).
This depends on your definition of serial cognition, under the definitions I like most the serial depth scales logarithmically with the number of tokens. This is because as you increase parallelism (in the sense you use above), that also increases serial depth logarithmically.
The basic intuitions for this are:
If you imagine mechanistically how you would add N numbers together, it seems like you would need logarithmic depth (where you recursively split it in half, compute the sums of each half, and then add the results together). Note that attention heads compute sums of numbers that scale with the number of tokens.
There are physical limits to the total amount of local computation that can be done in a given amount of time, due to speed-of-light constraints. So inasmuch as “serial depth” is supposed to capture intuitively what computation you can do with “time”, it seems like serial depth should increase as total computation goes up (reflecting the fact that you have to break up the computation into local pieces, as in the addition example above).