The model seems very, very benchmaxxed. Third party testing on unconventional or private benchmarks ends up placing even the largest gpt-oss below o4-mini, below the largest Qwen releases, and often it ends up below even the newer 30B~ Qwens in a few situations. It isn’t super capable to begin with, and the frankly absurd rate at which this model hallucinates kills what little use it might have with tool use. I think this model poses next to zero risk because it just isn’t very capable.
Seems plausible, I was just looking at benchmarks/evals at the time and it could be that it’s much less capable than it appears.
(Worth noting that I think (many? most? all?) other competitive open source models are also at least somewhat benchmark maxed, so this alters the comparison a bit, but doesn’t alter the absolute level of capability question.)
The model seems very, very benchmaxxed. Third party testing on unconventional or private benchmarks ends up placing even the largest gpt-oss below o4-mini, below the largest Qwen releases, and often it ends up below even the newer 30B~ Qwens in a few situations. It isn’t super capable to begin with, and the frankly absurd rate at which this model hallucinates kills what little use it might have with tool use. I think this model poses next to zero risk because it just isn’t very capable.
Seems plausible, I was just looking at benchmarks/evals at the time and it could be that it’s much less capable than it appears.
(Worth noting that I think (many? most? all?) other competitive open source models are also at least somewhat benchmark maxed, so this alters the comparison a bit, but doesn’t alter the absolute level of capability question.)