Leon Lang comments on OpenAI Claims IMO Gold Medal

Leon Lang 19 Jul 2025 12:37 UTC
18 points
0
The proofs look very different from how LLMs typically write, and I wonder how that emerged. Much more concise. Most sentences are not fully grammatically complete. A bit like how a human would write if they don’t care about form and only care about content and being logically persuasive.
- Rauno Arike 19 Jul 2025 13:15 UTC
  12 points
  2
  Parent
  The writing style looks fairly similar to the examples shown in Baker et al. (2025), so it seems plausible that this is a general consequence of doing a lot of RL training, rather than something specific to the methodology used for this model. It’s still concerning, but I’m happy that it doesn’t look noticeably less readable than the examples in the Baker et al paper.
- MondSemmel 19 Jul 2025 14:31 UTC
  8 points
  0
  Parent
  Maybe tokens are expensive, so there’s optimization pressure towards conciseness?
  - Seth Herd 19 Jul 2025 22:03 UTC
    2 points
    −4
    Parent
    Yes, I suspect this is the root of the issue. There are strong economic incentives to optimize for shorter sequences that produce correct answers. It’s great that this hasn’t harmed legibility of the chain of thought yet, but this pressure will likely create use of jargon that could quickly become a human-uneeadable CoT. I see this as one of the main dangers for effectively faithful CoT. And most of the reasonable hopes for aligning LLM-based AGI that I can see route through faithful CoT.
    
    There’s still the possibility that a fresh version of the same model will understand and be happy to correct a interpret the CoT if it’s become a unique language of thought. But that’s a lot shakier than CoT can be read by any model.
    - Leon Lang 19 Jul 2025 23:05 UTC
      8 points
      8
      Parent
      As I understand it, we don’t actually see the chain of thought here but only the final submitted solution. And I don’t think that a pressure to save tokens would apply to that.
      - Daniel Kokotajlo 20 Jul 2025 1:41 UTC
        11 points
        6
        Parent
        Well, maybe there’s some transfer? Maybe habits picked up from the CoT die hard & haven’t been trained away with RLHF yet?
        Thane Ruthenis 20 Jul 2025 10:59 UTC
        3 points
        0
        Parent
        I’d guess it has something to do with whatever they’re using to automatically evaluate the performance in “hard-to-verify domains”. My understanding is that, during training, those entire proofs would have been the final outputs which the reward function (or whatever) would have taken in and mapped to training signals. So their shape is precisely what the training loop optimized – and if so, this shape is downstream of some peculiarities on that end, the training loop preferring/enforcing this output format.
      - ryan_greenblatt 21 Jul 2025 17:39 UTC
        4 points
        0
        Parent
        If the AI is iterating on solutions, there is actually pressure to reduce the length of draft/candidate solutions. Then, it might be that OpenAI didn’t implement a clean up pass on the final solution (even though there wouldn’t be any real pressure to save tokens in the final clean up).
      - Thomas Dybdahl Ahle 20 Jul 2025 21:37 UTC
        1 point
        0
        Parent
        I think pressure on the final submitted solution is likely. That will encourage more insightful proofs over long monotonous case studies.
- David Johnston 20 Jul 2025 1:23 UTC
  5 points
  2
  Parent
  It’s a research prototype, so probably not style tuned
- Tao Lin 20 Jul 2025 1:10 UTC
  5 points
  1
  Parent
  This is OpenAI cot style. See it in the original o1 blog post. https://openai.com/index/learning-to-reason-with-llms/
  - Leon Lang 20 Jul 2025 9:21 UTC
    7 points
    1
    Parent
    Here is a screenshot of a chain of thought from the blog post you link:
    This looks different from the IMO solutions to me and doesn’t have the patterns I mentioned. E.g., the sentences are grammatically complete.
- Mikhail Samin 19 Jul 2025 23:19 UTC
  3 points
  0
  Parent
  This is false. It is exactly how RLed LLMs write.
  - Leon Lang 20 Jul 2025 9:16 UTC
    4 points
    0
    Parent
    Fwiw, I’ve recently used o3 a lot for requesting proofs, and it writes very differently.
    Could you give an example of an RLed LLM that writes like these examples?
    Though I agree with Rauno’s comment that it does look like the chain of thought examples from the Baker et al. paper.
    - Mikhail Samin 20 Jul 2025 10:57 UTC
      6 points
      0
      Parent
      A friend runs a startup where they do a lot of RL on COT on a narrow class of tasks, she shares those with me, I can’t share that, sorry. The style is incredibly similar/immediately pattern-matched though.
      I think r1 should also have a similar style in its COT.
      The proofs o3 outputs are going to be different from proofs it writes as it thinks about them. This is the style RLed models think in.
    - ACCount 20 Jul 2025 10:57 UTC
      3 points
      2
      Parent
      Did you have access to the full o3 reasoning trace, or just the final output? The two are not the same style at all.
      - Leon Lang 20 Jul 2025 11:05 UTC
        4 points
        3
        Parent
        Only the output! I thought Mikhail was referring to the output here, as this is what we see for the IMO problems.
        But as I see it now, the consensus seems to be something like “The chain of thought of new models does look like the IMO problem solutions, and if you don’t train the model to produce final answers that look nice to humans, then they will look like the chain of thought. Probably the experimental model’s answers were not yet trained to look nice”.
        Is this your position? I think that’s pretty plausible.
        ACCount 20 Jul 2025 12:26 UTC
        1 point
        0
        Parent
        You get the gist. I don’t think I’ve ever seen this specific style, but raw reasoning traces can end up looking even weirder.
- AlphaAndOmega 19 Jul 2025 14:07 UTC
  3 points
  1
  Parent
  It sounds a lot like what a human trying to tackle a difficult maths problem might mutter as they did so, without those bits trimmed out of the final product for clarity.