Consider both the time gap between a model being finished and released, and that benchmarks aren’t really capturing the whole picture at this capability level anymore.
Anthropic already had Mythos internally in February, whereas 2.6 was likely released a couple weeks at most after it finished. I think this alone puts the true gap at ~6 months, assuming Kimi catches up to Mythos benchmark level by about the end of the year which seems plausible with a “kimi 3”. The best model that is publicly released matters, but from a “recursive self improvement USA vs China race” perspective the true number that matters is the best internally available model.
It’s also important to note that actually using Kimi vs Opus in an agentic harness is a massive, noticable gap. It’s unfortunate that there are no hard metrics for this, so we have to go purely off vibes, but I’m pretty confident that if you ran a double blind study with Opus and Kimi in the same coding harness people would strongly prefer opus despite the benchmark scores implying that they should be ~equal
Consider both the time gap between a model being finished and released, and that benchmarks aren’t really capturing the whole picture at this capability level anymore.
Anthropic already had Mythos internally in February, whereas 2.6 was likely released a couple weeks at most after it finished. I think this alone puts the true gap at ~6 months, assuming Kimi catches up to Mythos benchmark level by about the end of the year which seems plausible with a “kimi 3”. The best model that is publicly released matters, but from a “recursive self improvement USA vs China race” perspective the true number that matters is the best internally available model.
It’s also important to note that actually using Kimi vs Opus in an agentic harness is a massive, noticable gap. It’s unfortunate that there are no hard metrics for this, so we have to go purely off vibes, but I’m pretty confident that if you ran a double blind study with Opus and Kimi in the same coding harness people would strongly prefer opus despite the benchmark scores implying that they should be ~equal