o4-mini performing best isn’t surprising, it leads or ties o3 and other larger models in most STEM-focused benchmarks. Similarly, o1-mini was better at math benchmarks than o1-preview, and Grok-3 mini generally gets better scores than Grok-3. The general explanation here is that RL training and experiments are cheaper and faster on mini models.
o4-mini performing best isn’t surprising, it leads or ties o3 and other larger models in most STEM-focused benchmarks. Similarly, o1-mini was better at math benchmarks than o1-preview, and Grok-3 mini generally gets better scores than Grok-3. The general explanation here is that RL training and experiments are cheaper and faster on mini models.