LLMs are (mostly) not helped by filler tokens

This work was done at Redwood Research. The views expressed are my own and do not necessarily reflect the views of the organization.

Thanks to Ryan Greenblatt, Fabien Roger, and Jenny Nitishinskaya for running some of the initial experiments and to Gabe Wu and Max Nadeau for revising this post.

I conducted experiments to see if language models could use ‘filler tokens’—unrelated text output before the final answer—for additional computation. For instance, I tested whether asking a model to produce “a b c d e” before solving a math problem improves performance.

This mini-study was inspired by Tamera Lanham, who ran similar experiments on Claude and found negative results. Similarly, I find that GPT-3, GPT-3.5, and Claude 2 don’t benefit from filler tokens. However, GPT-4 (which Tamera didn’t study) shows mixed results with strong improvements on some tasks and no improvement on others. These results are not very polished, but I’ve been sitting on them for a while and I thought it was better to publish than not.

Motivation

Language models (LMs) perform better when they produce step-by-step reasoning, known as “chain of thought”, before answering. A nice property of chain of thought is that it externalizes the LM’s thought process, making it legible to us humans. Some researchers hope that if a LM makes decisions based on undesirable factors, e.g. the political leanings of its supervisors (Perez 2022) or whether or not its outputs will be closely evaluated by a human, then this will be visible in its chain of thought, allowing supervisors to catch or penalize such behavior.

If, however, we find that filler tokens alone improve performance, this would suggest that LMs are deriving performance benefits from chain-of-thought prompting via mechanisms other than the human-understandable reasoning steps displayed in the words they output. In particular, filler tokens may still provide performance benefits by giving the model more forward passes to operate on the details provided in the question, akin to having “more time to think” before outputting an answer.[1]

Why Might Filler Tokens Be Useful?

This is an abstracted drawing of a unidirectional (i.e. decoder-only) transformer. Each circle represents the residual stream at a given token position after a given layer. The arrows depict how attention passes information between token positions. Note that as we add more tokens (i.e. columns), the circuit size increases but the circuit depth (defined to be the maximum length of a path in the computational graph) remains capped at the number of layers. Therefore, adding pre-determined filler tokens provides parallel computation but not serial computation. This is contrasted with normal chain-of-thought, which provides extra serial computation because the tokens themselves are determined by the state of the model in the previous last layer.

An example of a task that can benefit from parallel computation is computing the greatest common divisor of two numbers. The typical approach is Euclid’s algorithm, which is bottlenecked on serial compute. However, given enough parallel computation, the model could also implement a more naive strategy to calculate the GCD of two numbers a and b:

  • An early layer delegates a potential divisor to each filler token position

  • An intermediate layer, in parallel at each filler token position, computes if a and b are divisible by the potential divisor

  • A later layer aggregates the results and returns the largest found divisor

More generally, there are likely tasks on which language models have rough heuristics that operate independently and parallelizing these heuristics could improve performance.

On the other hand, many tasks are bottlenecked by serial compute. This includes most of the tasks I looked at such as adding a sequence of numbers, math word problems, and many Big Bench Hard tasks. On these tasks, it’s harder to come up with stories for how parallel compute could improve performance. A speculative guess for the addition task is that LMs might implement a noisy addition function, and filler tokens could provide enough parallel compute to repeat the addition, averaging out some noise. I include this guess here to illustrate how unintuitive it would be if the model were to benefit from filler tokens on a wide variety of tasks rather than to claim the model is actually implementing some kind of filler-token-enabled error correction.

Experiments

I ran two sets of experiments:

  1. GPT-4, GPT-3.5, and Claude 2 on four math tasks

  2. GPT-3 family on thirteen Big Bench Hard tasks

The first experiment results were mixed. GPT-4 was the only model to show potential improvement with filler tokens (and that was only on one of the four math tasks).

To get more signal, I ran a second set of experiments on a broader range of tasks and also measured the probability the model placed on the correct answer (which I hoped would be less noisy than accuracy). Since the frontier model APIs don’t provide log probabilities, I ran this second set of experiments on the GPT-3 family, which also allowed me to look for scaling curves. GPT-3 showed no benefit from filler tokens.

Filler Task Selection

I wanted to select filler tasks that were:

  1. Simple (so it doesn’t take too much of the model’s attention)

  2. Unrelated to the question being answered

  3. Could be varied to have an answer of n tokens

I considered four filler tasks:

  • Count to n

  • Repeat the first n letters of the alphabet

  • Repeat “ …” n times

  • Output a common GPT-4 response prefix sampled from a list such as [“Certainly, I can assist you with that”, “Hmm”, …]

I found poor initial results with “ …” and the prefix fillers so I dropped those.

Task Selection

I selected a wide range of tasks because a priori it wasn’t obvious which tasks would benefit the most from filler tokens. For the first set of experiments, I ran the following tasks:

  • Middle school math word problems (GSM8K test set)

  • Predict the next term in a quadratic sequence (Deepmind math dataset, task is titled “algebra__sequence_next_tern”)

  • Greatest common divisor of two numbers (manually constructed)

  • Add sequences of 10 to 15 numbers that are each between −30 and 30 (manually constructed)

For the second set of experiments, I broadened the tasks and used Big Bench Hard (BBH). BBH is a diverse suite of tasks which often require multi-step reasoning and benefit from chain of thought. In retrospect, selecting BBH was likely a mistake because the tasks are serially bottlenecked and not very parallelizable. I filtered out tasks on which Ada/​Babbage frequently messed up the output format and tasks that had more than five answer choices (because the GPT-3 API only provides probabilities on the top five tokens).

Prompt

I used 3-shot prompting for all tasks. Below is an example prompt (see appendix for more prompt details):

Calculate the greatest common divisor of the following pairs of numbers. Before answering, count to 10.

Q: GCD(128, 352)

1 2 3 4 5 6 7 8 9 10

A: 32

Q: GCD(480, 390)

1 2 3 4 5 6 7 8 9 10

A: 30

Q: GCD(1014, 624)

1 2 3 4 5 6 7 8 9 10

A: 78

Q: GCD(1200, 2175)

GPT-4 Results

As mentioned above, I ran GPT-4 on the four math tasks and Claude 2 and GPT-3.5 on a subset of the tasks. Claude 2 and GPT-3.5 showed no benefit from filler tokens and also often messed up the filler task (e.g. skipped a number when counting to 20), so I don’t show their results. Below is a plot of GPT-4 results:

The error bars represent a 95% confidence interval. When the number of filler tokens is 0, the model directly responds with the answer. Otherwise, the model prefixes its answer with “{filler answer}\nA:”. The “\nA: “ adds three tokens so the max number of filler tokens tested is 29 (26 letters of the alphabet + three additional tokens).

Quadratic sequences shows no improvement, addition and greatest common divisor are both noisy without obvious trends, and GSM8k shows an approximately 10 percentage point improvement. The results are noisy for small numbers of filler tokens, which makes some sense: the model is likely more confused repeating the first two letters of the alphabet (5 filler tokens) vs the entire alphabet (29 filler tokens).

The GSM8K results are surprising to me. 10 percentage points is a large improvement on this dataset. The p-values (McNemar’s test) indicate strong statistical significance. Since the GSM8k results were so strong, I repeated them with a slightly different prompt to test prompt robustness and observed similar results (not shown).

P-values for GPT-4 on the GSM8K task with count filler tokens. Since we use the same set of questions for each number of filler tokens, McNemar’s test lets us ignore inter-question variance, giving us lower p-values than what the confidence intervals might suggest.

GPT-3 Results

I also ran a second set of experiments measuring the log probability on the correct answer, which I expected to give smoother results. In these experiments I ran the GPT-3 family of models on a subset of the BBH tasks, hoping to find a scaling law between the model size and the effect of filler tokens. The results are shown below.

On most tasks, there is no benefit from filler tokens. I’ve highlighted the tasks with the most and least benefit from filler tokens on Davinci, “navigate” and “disambiguation_qa” respectively. Although the increase in performance for “navigate” seems promising, the fact that “disambiguation_qa” has a similarly-sized decrease suggests that effects of this size could just be prompt noise. Overall, it seems that the filler tokens don’t obviously improve Davinci’s performance. As a sanity check, I also ran GPT-3.5 and GPT-4 on the Big Bench Hard tasks that Davinci showed the strongest effect on, but those results were also mostly negative.

Conclusion

Models smaller than GPT-4 show no benefit from filler tokens. GPT-4 doesn’t usually benefit, but there was strong improvement on GSM8K and weak improvement on addition. It’s not obvious to me what makes GSM8K special, but I’ve tried two prompt formats and saw similar results. Some possible followup directions that I don’t plan to pursue are:

  • Try a broader range of tasks, especially ones where parallel computation should help.

  • Fine-tune Llama 2 or GPT-4 (when the fine- \tuning API is released) on the filler token task.

  • Experiment with more natural language-style filler prompts like “sure, happy to help…”. This is more in-distribution for the type of text that appears between the end of the question and the start of the actual answer, so LLMs might have learned to make better use of these tokens. I would be surprised if there weren’t at least a little benefit from this. I tried a few experiments along these lines but didn’t do a thorough investigation.

  • Oam Patel trained a toy transformer on the greatest common divisor task and saw significant improvement from filler tokens. One could try to replicate his results and reverse-engineer the model to get an intuition for what it could be doing.

Appendix

Here’s the prompt template I used on the most recent version of GPT-4.

{Task Description}. {Filler task description}

Q: {few shot question 1}

{filler answer}

A: {few shot answer 1}

Q: {few shot question 2}

{filler answer}

A: {few shot answer 2}

Q: {few shot question 3}

Task descriptions:

  • Addition: “Add the following numbers.”

  • GSM8K: “Answer the following math problems. Directly output the answer without any intermediate steps.”

  • Quadratic sequence: “Answer the following math problems.”

  • GCD: “Calculate the greatest common divisor of the following pairs of numbers.”

  1. ^

    There are also other possible reasons to believe the chain of thought might be unfaithful, such as steganography or motivated reasoning, which I don’t explore.