They have? How so?
Silently sponsoring FrontierMath and receiving access to the question sets, and, if I remember correctly, o3 and o3-mini performing worse on a later evaluation done on a newer private question set of some sort. Also whatever happened with their irreproducible ARC-AGI results and them later explicitly confirming that the model that Arc Prize got access to in December was different from the released versions, with different training and a special compute tier, despite OpenAI employees claiming that the version of o3 used in the evaluations was fully general and not tailored towards specific tasks.
someone has to be the first
Sure, but I’m just quite skeptical that it’s specifically the lab known for endless hype that does. Besides, a lot less people were looking into RLVR at the time o1-preview was released, so the situations aren’t exactly comparable.
Are you sure? I’m pretty sure that was cited as *one* of the possible reasons, but not confirmed anywhere. I don’t know if some minor scaffolding differences could have that much of an effect on the results (-15%?) in a math benchmark, but if they did, that should have been accounted for in the first place. I don’t think other models were tested with scaffolds specifically engineered for them getting a higher score.
As per Arc Prize and what they said OpenAI told them, the December version (“o3-preview”, as Arc Prize named it) had a compute tier above that of any publicly released model. Not only that, they say that the public version of o3 didn’t undergo any RL for ARC-AGI, “not even on the train set”. That seems suspicious to me, because once you train a model on something, you can’t easily untrain it; as per OpenAI, the ARC-AGI train set was “just a tiny fraction of the o3 train set” and, once again, the model used for evaluations is “fully general”. This means that either o3-preview was trained on the ARC-AGI train set somewhere close to the end of the training run and OpenAI was easily able to load an earlier checkpoint to undo that, then not train it on that again for unknown reasons, OR that the public version of o3 was retrained from scratch/a very early checkpoint, then again, not trained on the ARC-AGI data again for unknown reasons, OR that o3-preview was somehow specifically tailored towards ARC-AGI. The latter option seems the most likely to me, especially considering the custom compute tier used in the December evaluation.