It seems to me that the gap between US and Chinese models is < 2 months (when you don’t count Mythos)
Kimi K2.6 was released April 2026 while Opus 4.6 was released February 2026, and according to https://artificialanalysis.ai, Kimi K2.6 is more capable (54 > 53). Kimi K2.6 is better in SciCode (54% > 52%) while Opus is better on Terminal-Bench Hard (46% > 44%)
Plus, Kimi is 5x cheaper and has 3x throughput (but has 4x less context window)
Consider both the time gap between a model being finished and released, and that benchmarks aren’t really capturing the whole picture at this capability level anymore.
Anthropic already had Mythos internally in February, whereas 2.6 was likely released a couple weeks at most after it finished. I think this alone puts the true gap at ~6 months, assuming Kimi catches up to Mythos benchmark level by about the end of the year which seems plausible with a “kimi 3”. The best model that is publicly released matters, but from a “recursive self improvement USA vs China race” perspective the true number that matters is the best internally available model.
It’s also important to note that actually using Kimi vs Opus in an agentic harness is a massive, noticable gap. It’s unfortunate that there are no hard metrics for this, so we have to go purely off vibes, but I’m pretty confident that if you ran a double blind study with Opus and Kimi in the same coding harness people would strongly prefer opus despite the benchmark scores implying that they should be ~equal
It seems to me that the gap between US and Chinese models is < 2 months (when you don’t count Mythos)
Kimi K2.6 was released April 2026 while Opus 4.6 was released February 2026, and according to https://artificialanalysis.ai, Kimi K2.6 is more capable (54 > 53). Kimi K2.6 is better in SciCode (54% > 52%) while Opus is better on Terminal-Bench Hard (46% > 44%)
Plus, Kimi is 5x cheaper and has 3x throughput (but has 4x less context window)
Consider both the time gap between a model being finished and released, and that benchmarks aren’t really capturing the whole picture at this capability level anymore.
Anthropic already had Mythos internally in February, whereas 2.6 was likely released a couple weeks at most after it finished. I think this alone puts the true gap at ~6 months, assuming Kimi catches up to Mythos benchmark level by about the end of the year which seems plausible with a “kimi 3”. The best model that is publicly released matters, but from a “recursive self improvement USA vs China race” perspective the true number that matters is the best internally available model.
It’s also important to note that actually using Kimi vs Opus in an agentic harness is a massive, noticable gap. It’s unfortunate that there are no hard metrics for this, so we have to go purely off vibes, but I’m pretty confident that if you ran a double blind study with Opus and Kimi in the same coding harness people would strongly prefer opus despite the benchmark scores implying that they should be ~equal