Leon Lang comments on OpenAI Claims IMO Gold Medal

Leon Lang 19 Jul 2025 23:05 UTC
8 points
8
As I understand it, we don’t actually see the chain of thought here but only the final submitted solution. And I don’t think that a pressure to save tokens would apply to that.
- Daniel Kokotajlo 20 Jul 2025 1:41 UTC
  11 points
  6
  Parent
  Well, maybe there’s some transfer? Maybe habits picked up from the CoT die hard & haven’t been trained away with RLHF yet?
  - Thane Ruthenis 20 Jul 2025 10:59 UTC
    3 points
    0
    Parent
    I’d guess it has something to do with whatever they’re using to automatically evaluate the performance in “hard-to-verify domains”. My understanding is that, during training, those entire proofs would have been the final outputs which the reward function (or whatever) would have taken in and mapped to training signals. So their shape is precisely what the training loop optimized – and if so, this shape is downstream of some peculiarities on that end, the training loop preferring/enforcing this output format.
- ryan_greenblatt 21 Jul 2025 17:39 UTC
  4 points
  0
  Parent
  If the AI is iterating on solutions, there is actually pressure to reduce the length of draft/candidate solutions. Then, it might be that OpenAI didn’t implement a clean up pass on the final solution (even though there wouldn’t be any real pressure to save tokens in the final clean up).
- Thomas Dybdahl Ahle 20 Jul 2025 21:37 UTC
  1 point
  0
  Parent
  I think pressure on the final submitted solution is likely. That will encourage more insightful proofs over long monotonous case studies.