It seems likely that (e.g.) o1-mini is quite small given that it generates at 220 tokens per second(!), perhaps <50 billion active parameters based on eyeballing the chart from the article I linked earlier from Epoch. I’d guess (with very low confidence) 100 billion params overall (not including just active params). Likely something similar holds for o3-mini.
The update isn’t as large as it might seem at first because users don’t (typically) need to be send the reasoning tokens (or outputs from inference time compute usage more generally) which substantially reduces uploads. Indeed, the average compute per output token for o1 and o3 (counting the compute from reasoning tokens but just including tokens sent to the user as output tokens) is probably actually higher than it was for the original GPT-4 release despite these models potentially being smaller.
I now think this strategy looks somewhat less compelling due to a recent trend toward smaller (rather than larger) models, particularly from OpenAI, and increased inference time compute usage creating more for an incentive for small models.
It seems likely that (e.g.) o1-mini is quite small given that it generates at 220 tokens per second(!), perhaps <50 billion active parameters based on eyeballing the chart from the article I linked earlier from Epoch. I’d guess (with very low confidence) 100 billion params overall (not including just active params). Likely something similar holds for o3-mini.
The update isn’t as large as it might seem at first because users don’t (typically) need to be send the reasoning tokens (or outputs from inference time compute usage more generally) which substantially reduces uploads. Indeed, the average compute per output token for o1 and o3 (counting the compute from reasoning tokens but just including tokens sent to the user as output tokens) is probably actually higher than it was for the original GPT-4 release despite these models potentially being smaller.