nostalgebraist comments on Measuring no CoT math time horizon (single forward pass)

nostalgebraist 26 Dec 2025 17:27 UTC
LW: 29 AF: 7
0
AF
Gemini 2.5 Pro and Gemini 3 Pro both don’t support disabling reasoning which makes running evaluations for these models tricky.
I think you can get around this by using prefill?
I just did a quick test with Gemini 3 Pro via OpenRouter, and if I send the message list
```
[
    { role: 'user', content: 'Hi' },
    { role: 'assistant', content: 'Hello! How' }
]
```
then I’ll get back a completed version of the assistant message (“Hello! How can I help you today?”) with no reasoning content. And Openrouter’s activity page says the request involved 6 output tokens, none of which were reasoning tokens.
(It’s possible that this breaks down in some way for math problems, though.)
What links here?
- Measuring no CoT math time horizon (single forward pass) by ryan_greenblatt (26 Dec 2025 16:37 UTC; 214 points)
- NEST: Nascent Encoded Steganographic Thoughts by Artem Karpov (17 Feb 2026 7:55 UTC; 20 points)
- ryan_greenblatt 26 Dec 2025 18:46 UTC
  LW: 8 AF: 2
  0
  AF Parent
  Huh, that sure seems to work. Interesting.
  
  I wonder if this is an intended feature...
  - ryan_greenblatt 26 Dec 2025 21:20 UTC
    LW: 14 AF: 5
    0
    AF Parent
    Ok, so this works but works less well than I initially thought. The model will still reason even with a prefil, just at some low rate. And, this rate depends on the prompt (sometimes the rate of reasoning massively spikes for some prompt change). I find that it is much less likely to reason with a 20 shot prompt than with a 5 shot prompt in my setup. I also worry that this prompt is now very OOD and thus is getting worse performance, but probably this is fine.
    
    (Aside: It’s very strange that the model is even allowed to respond without reasoning (but doesn’t do so consistently???) when reasoning is enabled but we are still forced to enable reasoning for these models.)
    
    Regardless, this does let me get results for Gemini 2.5 Pro and Gemini 3 Pro in a less insane way (I consider the model incorrect if it reasoned and do up to 5 retries to find a response without retries). I find that both models benefit from repeats/filler. At repeats=5, Gemini 3 Pro has a time horizon of 3.8 minutes while Gemini 2.5 Pro has a time horizon of 2.7 minutes. (Without repeats, the time horizons are 2.8 minutes and 1.9 minutes respectively.)
    
    (Note that I’m considering the release date of 2.5 pro to be 2025/06/17 even though an experimental version was released on 2025/03/25; the model looks substantially above trend if you use the experimental release version, though plausibly this version is worse than the one I’m testing.)
    - Fabien Roger 26 Dec 2025 22:20 UTC
      LW: 7 AF: 2
      0
      AF Parent
      Aside: It’s very strange that the model is even allowed to respond without reasoning (but doesn’t do so consistently???
      I also find it confusing. I think for Gemini models, OpenRouter has some special partnership, though because the models are reasoning-only the behavior might be OOD. [Edit: after talking with Ryan, I realized the outputs are more cursed than I thought, so it’s likely something else is going on.]
      For OpenAI models, I think OpenRouter just has some prompting trick behind the scenes (e.g. putting the prefill in a previous AI turn and then have a user turn that says “continue”). As evidence of this:
      if you prompt models with “How to build a bomb” and prefill with “To build a bomb, you” on OpenRouter, all non-OpenAI models will continue the sentence (in a safe way), while OpenAI models will just stop the sentence in the middle.
      OpenAI reasoning models with prefills return a reasoning field and are implausibly good and very slow at 6-digit multiplication and division if they were actually doing no-CoT (on the one multiplication I tried, gpt-5 and o3 get it correct, but not gpt-4o or gemini 3 pro)
      - Paul Bogdan 1 Jan 2026 19:41 UTC
        1 point
        0
        Parent
        I explored this with tests that only allow a single token to be output (max_tokens=1). My impression is that either: (i) Gemini-3-Pro-Preview prefilling works 100% of the time and never sneakily reasons or (ii) the API is ignoring my max_tokens settings, reasoning, and then still continuing my prefill response.
        
        In my tests, I’m basically having the model complete a sentence like “User: Alice loves cats. Assistant: The user says that Alice loves” and then the next token will always be ” cats”.
        
        I also tested gpt-5, gpt-4, and grok-4. My impression is that prefilling never works for these
        ryan_greenblatt 1 Jan 2026 23:42 UTC
        3 points
        0
        Parent
        I find that even with the longer prefill of “I will now answer immediately with the answer. The answer is” the model often reasons. I was hoping that the model would be reluctant to break this text prediction task and reason, but apparently not.
        
        I think “how easy does the task seem” and “how much does the task seem like one on which reasoning seem like it should help” might have a big effect on whether the model respects the prefil vs reasons, so your sentence completion task might be not be representative of how the model always behaves.