Josh You comments on AI Task Length Horizons in Offensive Cybersecurity

Josh You 4 Jul 2025 2:00 UTC
3 points
0
o4-mini performing best isn’t surprising, it leads or ties o3 and other larger models in most STEM-focused benchmarks. Similarly, o1-mini was better at math benchmarks than o1-preview, and Grok-3 mini generally gets better scores than Grok-3. The general explanation here is that RL training and experiments are cheaper and faster on mini models.