Thanks. One thing that confuses me is that, if this is true, why do mini reasoning models often seem to out-perform their full counterparts at certain tasks?
e.g. grok 3 beta mini (think) performed overall roughly the same or better than grok 3 beta (think) on benchmarks[1]. And I remember a similar thing with OAI’s reasoning models.
Full Grok 3 only had a month for post-training, and keeping responses on general topics reasonable is a fiddly semi-manual process. They didn’t necessarily have the R1-Zero idea either, which might make long reasoning easier to scale automatically (as long as you have enough verifiable tasks, which is the thing that plausibly fails to scale very far).
Also, running long reasoning traces for a big model is more expensive and takes longer, so the default settings will tend to give smaller reasoning models more tokens to reason with, skewing the comparison.
One non-technical forecast, related to gpt4.5′s announcement: https://x.com/davey_morse/status/1895563170405646458
Thanks. One thing that confuses me is that, if this is true, why do mini reasoning models often seem to out-perform their full counterparts at certain tasks?
e.g. grok 3 beta mini (think) performed overall roughly the same or better than grok 3 beta (think) on benchmarks[1]. And I remember a similar thing with OAI’s reasoning models.
[1] https://x.ai/blog/grok-3
Full Grok 3 only had a month for post-training, and keeping responses on general topics reasonable is a fiddly semi-manual process. They didn’t necessarily have the R1-Zero idea either, which might make long reasoning easier to scale automatically (as long as you have enough verifiable tasks, which is the thing that plausibly fails to scale very far).
Also, running long reasoning traces for a big model is more expensive and takes longer, so the default settings will tend to give smaller reasoning models more tokens to reason with, skewing the comparison.