Steven Byrnes comments on Recent LLMs can use filler tokens or problem repeats to improve (no-CoT) math performance

Steven Byrnes 22 Dec 2025 20:18 UTC
LW: 31 AF: 12
2
AF
This might be a dumb question, but did you try anything like changing the prompt from:
…After the problem, there will be filler tokens (counting from 1 to {N}) to give you extra space to process the problem before answering.…
to:
…After the problem, there will be distractor tokens (counting from 1 to {N}) to give you extra space to forget the problem before answering.…
I’m asking because AFAICT the results can be explained by EITHER your hypothesis (the extra tokens allow more space / capacity for computation during a forward pass) OR an alternate hypothesis more like “the LLM interprets this as more of a situation where the correct answer is expected” or whatever, i.e. normal sensitivity of LLMs to details of their prompt.
(Not that I have anything against the first hypothesis! Just curious.)
- ryan_greenblatt 22 Dec 2025 22:14 UTC
  LW: 45 AF: 25
  0
  AF Parent
  Recall that without filler, Opus 4.5 performance is 45.2%. I tried the following experiments on Opus 4.5 with filler counting to 300:
  - Default (what I do by default in the blog post above): 51.1%
  - Remove the text explaining filler tokens (as in, cut “After the problem …”): 50.4%
  - Use “After the problem, there will be distractor tokens (counting from 1 to {filler_tokens})”: 51.1%
  - Don’t actually use filler tokens, but include in the prompt “After the problem, there will be filler tokens (counting from 1 to 300) to give you extra space to process the problem before answering” (as in, this is just a lie, we don’t give filler): 45.8%
  So, it seems like the framing doesn’t matter ~at all and actually having the filler tokens is the key thing (at least for Opus 4.5, though I strongly expect this would reproduce for Opus 4, Sonnet 4).
  - Gurkenglas 23 Dec 2025 22:35 UTC
    9 points
    2
    Parent
    Please try not to lie to the models. You can truthfully say “After the problem, there will be a [p]% chance of filler tokens (counting from 1 to 300) to give you extra space to process the problem before answering.” and do observational statistics.
    What links here?
    Gurkenglas's comment on Do Models Continue Misaligned Actions? [eval] by Jordan Taylor (10 Feb 2026 0:21 UTC; 2 points)
- ryan_greenblatt 22 Dec 2025 21:48 UTC
  LW: 4 AF: 3
  0
  AF Parent
  Note that repeating the problem X times also works (and yields similar performance increase to filler tokens given the optimal number of repeats/filler). Also yields similar boost across different types of filler (which you’d naively result in different suggestion etc.
  
  I can quickly test this though, will run in one sec.