Thane Ruthenis comments on Burny’s Shortform

Thane Ruthenis 20 Jul 2025 10:05 UTC
2 points
0
Pretty much everybody is looking into test-time compute and RLVR right now. How come (seemingly) nobody else has found out about this “new general-purpose method” before OpenAI?
Well, someone has to be the first, and they got to RLVR itself first last September.
OpenAI has been shown to not be particularly trustworthy when it comes to test and benchmark results
They have? How so?
- lwreader132 20 Jul 2025 11:10 UTC
  1 point
  −1
  Parent
  They have? How so?
  Silently sponsoring FrontierMath and receiving access to the question sets, and, if I remember correctly, o3 and o3-mini performing worse on a later evaluation done on a newer private question set of some sort. Also whatever happened with their irreproducible ARC-AGI results and them later explicitly confirming that the model that Arc Prize got access to in December was different from the released versions, with different training and a special compute tier, despite OpenAI employees claiming that the version of o3 used in the evaluations was fully general and not tailored towards specific tasks.
  someone has to be the first
  Sure, but I’m just quite skeptical that it’s specifically the lab known for endless hype that does. Besides, a lot less people were looking into RLVR at the time o1-preview was released, so the situations aren’t exactly comparable.
  - Thane Ruthenis 20 Jul 2025 11:21 UTC
    2 points
    0
    Parent
    Silently sponsoring FrontierMath and receiving access to the question sets, and, if I remember correctly, o3 and o3-mini performing worse on a later evaluation done on a newer private question set of some sort
    IIRC, that worse performance was due to using a worse/less adapted agency scaffold, rather than OpenAI making the numbers up or engaging in any other egregious tampering. Regarding ARC-AGI, the December-2024 o3 and the public o3 are indeed entirely different models, but I don’t think it implies the December one was tailored for ARC-AGI.
    I’m not saying OpenAI isn’t liable to exaggerate or massage their results for hype, but I don’t think they ever outright cheated (as far as we know, at least). Especially given that getting caught cheating will probably spell the AI industry’s death, and their situation doesn’t yet look so desperate that they’d risk that.
    So I currently expect that the results here are, ultimately, legitimate. What I’m very skeptical about are the implications of those results that people are touting around.
    - lwreader132 20 Jul 2025 11:59 UTC
      3 points
      0
      Parent
      Worse performance was due to using a worse/less adapted agency scaffold
      Are you sure? I’m pretty sure that was cited as *one* of the possible reasons, but not confirmed anywhere. I don’t know if some minor scaffolding differences could have that much of an effect on the results (-15%?) in a math benchmark, but if they did, that should have been accounted for in the first place. I don’t think other models were tested with scaffolds specifically engineered for them getting a higher score.
      December-2024 o3 and the public o3 are indeed entirely different models, but I don’t think it implies the December one was tailored for ARC-AGI.
      As per Arc Prize and what they said OpenAI told them, the December version (“o3-preview”, as Arc Prize named it) had a compute tier above that of any publicly released model. Not only that, they say that the public version of o3 didn’t undergo any RL for ARC-AGI, “not even on the train set”. That seems suspicious to me, because once you train a model on something, you can’t easily untrain it; as per OpenAI, the ARC-AGI train set was “just a tiny fraction of the o3 train set” and, once again, the model used for evaluations is “fully general”. This means that either o3-preview was trained on the ARC-AGI train set somewhere close to the end of the training run and OpenAI was easily able to load an earlier checkpoint to undo that, then not train it on that again for unknown reasons, OR that the public version of o3 was retrained from scratch/a very early checkpoint, then again, not trained on the ARC-AGI data again for unknown reasons, OR that o3-preview was somehow specifically tailored towards ARC-AGI. The latter option seems the most likely to me, especially considering the custom compute tier used in the December evaluation.