Full Grok 3 only had a month for post-training, and keeping responses on general topics reasonable is a fiddly semi-manual process. They didn’t necessarily have the R1-Zero idea either, which might make long reasoning easier to scale automatically (as long as you have enough verifiable tasks, which is the thing that plausibly fails to scale very far).
Also, running long reasoning traces for a big model is more expensive and takes longer, so the default settings will tend to give smaller reasoning models more tokens to reason with, skewing the comparison.
Full Grok 3 only had a month for post-training, and keeping responses on general topics reasonable is a fiddly semi-manual process. They didn’t necessarily have the R1-Zero idea either, which might make long reasoning easier to scale automatically (as long as you have enough verifiable tasks, which is the thing that plausibly fails to scale very far).
Also, running long reasoning traces for a big model is more expensive and takes longer, so the default settings will tend to give smaller reasoning models more tokens to reason with, skewing the comparison.