Isopropylpod comments on ryan_greenblatt’s Shortform

Isopropylpod 6 Aug 2025 5:57 UTC
16 points
7
The model seems very, very benchmaxxed. Third party testing on unconventional or private benchmarks ends up placing even the largest gpt-oss below o4-mini, below the largest Qwen releases, and often it ends up below even the newer 30B~ Qwens in a few situations. It isn’t super capable to begin with, and the frankly absurd rate at which this model hallucinates kills what little use it might have with tool use. I think this model poses next to zero risk because it just isn’t very capable.
What links here?
- OpenAI’s GPT-OSS Is Already Old News by Zvi (8 Aug 2025 12:20 UTC; 40 points)
- ryan_greenblatt 6 Aug 2025 15:04 UTC
  5 points
  −2
  Parent
  Seems plausible, I was just looking at benchmarks/evals at the time and it could be that it’s much less capable than it appears.
  
  (Worth noting that I think (many? most? all?) other competitive open source models are also at least somewhat benchmark maxed, so this alters the comparison a bit, but doesn’t alter the absolute level of capability question.)