I’m not sure if this is helpful (you might already know), but in Let’s Think Dot By Dot, they found that LLM’s could use filler tokens to improve computation, but they had to be specially trained for it to work. By default the extra tokens didn’t help.
Actually, I haven’t seen this article! Thank you very much; it seems very interesting, as do the references cited therein. However, I suppose the distribution from which “filler tokens” (or extra tokens) are drawn might matter, as well as their sequences (that is not just “…”, “abcd”, or “<pause>”, but something more sophisticated might be more useful for a model). It would be very interesting to determine which “filler sequences” are most suitable for hiding computations for specific tasks (this is one of the directions we are working on) and which circuits are responsible for it (if they exist).
I’m not sure if this is helpful (you might already know), but in Let’s Think Dot By Dot, they found that LLM’s could use filler tokens to improve computation, but they had to be specially trained for it to work. By default the extra tokens didn’t help.
Actually, I haven’t seen this article! Thank you very much; it seems very interesting, as do the references cited therein. However, I suppose the distribution from which “filler tokens” (or extra tokens) are drawn might matter, as well as their sequences (that is not just “…”, “abcd”, or “<pause>”, but something more sophisticated might be more useful for a model). It would be very interesting to determine which “filler sequences” are most suitable for hiding computations for specific tasks (this is one of the directions we are working on) and which circuits are responsible for it (if they exist).