Seems plausible, I was just looking at benchmarks/evals at the time and it could be that it’s much less capable than it appears.
(Worth noting that I think (many? most? all?) other competitive open source models are also at least somewhat benchmark maxed, so this alters the comparison a bit, but doesn’t alter the absolute level of capability question.)
Seems plausible, I was just looking at benchmarks/evals at the time and it could be that it’s much less capable than it appears.
(Worth noting that I think (many? most? all?) other competitive open source models are also at least somewhat benchmark maxed, so this alters the comparison a bit, but doesn’t alter the absolute level of capability question.)