Is the gap between open and closed models growing? Evidence from WeirdML
Previous post: Introducing the WeirdML Benchmark
WeirdML is a benchmark challenging LLMs to solve a set of weird and unusual machine learning tasks designed to require careful thinking and understanding of the data and its properties. We have recently run all the major historical models we could find, going back to GPT-4 in June 2023[1], to study the evolution of these hard coding capabilities over time.
We also compared the frontier among open-weights models over time with the closed models. We find that it took more than a year for the GPT-4 capability to be replicated by Llama-3-405B. Claude models Opus-3 (Mar 2024) and Sonnet-3.5 (Jun 2024) made significant advances over GPT-4, but were completely eclipsed with the arrival of o1-preview and reasoning models (Sept 2024). Four months later DeepSeek-R1 provided the first really capable open reasoning model. However, it did not actually catch up to the closed reasoning models, at least not on these kinds of tasks. o3 (Apr 2025) advanced the closed frontier further, and a new DeepSeek-R1 version (May 2025) improving but still not catching up to o1. Now several Chinese models (Qwen3, Kimi-K2 and GLM-4.5) have roughly caught up to DeepSeek, but not advanced further. So 11 months later, open models have still not caught up to the o1 advance, suggesting that the gap between open and closed models is not shrinking over time.
We have to extrapolate based on very little data here, and any conclusions are tentative. The future R2 from DeepSeek and the comparison to GPT-5 will probably be the next important datapoint in this discussion, as well as the open questions regarding the new Meta superintelligence team, its capability as an AI lab and its commitment to open models going forward.
Appendix: Full Model Performance Table
- ^
GPT-4 was released before this time, but the first snapshot with the context length to solve this benchmark was from June 13, 2023. At least of the snapshots still avaliable on the API.
The new gpt-oss-120b may close the gap some.
https://openai.com/index/introducing-gpt-oss/
Actually, it seems to come in around the level of the leading chinese models, so the gap is not closing much, at least not on these kinds of tasks.
Indeed. It looks like gpt-oss models were optimized to fit on consumer hardware, which makes it difficult to compete with much larger models.