Open source Chinese models can achieve similar performance in METR’s time horizon as western SOTAs, by simply running twice, and still being cheaper.
I did a small research when waiting my flight at SFO, using METR’s TH 1.0 dataset (unfortunately, TH 1.1 does not have any Chinese model).
The math behind this is simple: View the success/fail of a task as an independent event. For a certain task, given the success rate of a Chinese LLM (p) and a SOTA LLM (p_sota), the try out number n for a Chinese LLM can be derived as:
Or, by trying n times, the overall success rate r = 1 - (1-p)^n.
I picked these ‘rival pairs’ for Chinese and then-SOTA US LLMs:
DeepSeek R1-0528 (May 28) vs o3 (Apr 16, −42d), Claude 4 Opus (May 22, −6d)
Kimi K2-thinking (Nov 6) vs GPT-5.1-Codex-Max (Nov 19, +13d), Claude Opus 4.5 (Nov 24, +18d)
Two ‘Bridge models’: GPT-5 and Claude 4.1 Opus (both Aug) connect the two groups.
since METR have already provided time horizons for p_sota=50% and p_sota=80%, the p for Chinese models can be derived by fitting the curves.
For most rival pairs, EXCEPT kimi-k2-thinking v. Opus 4.5 at p50, it only requires 2 independent runs to match US SOTA model performance.
METR provides token usage for each task. By multiplying the token price and the average token usage for tasks within a certain time horizon, the average price count can be calculated.
Given Chinese models are significantly cheaper than OAI or Anthropic’s models, it is not surprising for this chart:
Please keep in mind that this evaluation is naive since a) we only have two Chinese models. b) best-of-n is only valid in easy-to-verify domains, like METR’s software engineering evaluations. and c) METR benchmark tasks are not production workloads
I did this research because i saw someone on X saying (sorry i cannot find the source rn) Chinese models are bs since they are less intelligent. Yes, but in easy-to-verify domains, one can simply run a ‘swarm’ of Chinese models in parallel and get things done eventually. Just like how Shein products flood the US market!
I’d find this more persuasive if you actually ran the evals, rather than just fit the curves to your model.
Naively I’d be surprised if “just run it twice” works for most tasks. Your math assumes structural independence of task completion, which would be surprising if true (and honestly invalidate METR results much more broadly than you’re implying, since it’d also imply that the Western models are much more capable of solving many tasks with longer time horizons than is currently implied).
(And separately I’m pretty sure the difference between best-of-one and best-of-n performance is not nearly as large as you’re implying, on other tasks. Though I’m not familiar with the literature on METR tasks specifically, the prior is strongly against)
Thanks for pointing out! i re-evaluated the data and my methodology is flawed.
METR provides task-level success rate. They run a model several time (4-8 typically) on one task. Here is the per-task success rate graph for tasks > 1h human completion time, using Opus 4.5, codex-5,1-max and kimi-k2 (made by sonnet 4.6)
the X-axis is ordered by human completion time (time horizon). beyond some certain point, kimi almost always fails. yet SOTA models could succeed (red blocks in heat map). For these tasks, running n times would not help much since they are likely beyond a model’s capabilities.
However, for tasks that both models can complete (the green-ish blocks), i still lean to assume the structural independence of runs, (e.g. one could run a task several times with a clean context window each time). My mental model for these tasks is like this slot machine meme. best-of-n sampling will improve the overall success rate on these tasks IMO.
Open source Chinese models can achieve similar performance in METR’s time horizon as western SOTAs, by simply running twice, and still being cheaper.
I did a small research when waiting my flight at SFO, using METR’s TH 1.0 dataset (unfortunately, TH 1.1 does not have any Chinese model).
The math behind this is simple: View the success/fail of a task as an independent event. For a certain task, given the success rate of a Chinese LLM (p) and a SOTA LLM (p_sota), the try out number n for a Chinese LLM can be derived as:
Or, by trying n times, the overall success rate r = 1 - (1-p)^n.
I picked these ‘rival pairs’ for Chinese and then-SOTA US LLMs:
DeepSeek R1-0528 (May 28) vs o3 (Apr 16, −42d), Claude 4 Opus (May 22, −6d)
Kimi K2-thinking (Nov 6) vs GPT-5.1-Codex-Max (Nov 19, +13d), Claude Opus 4.5 (Nov 24, +18d)
Two ‘Bridge models’: GPT-5 and Claude 4.1 Opus (both Aug) connect the two groups.
since METR have already provided time horizons for p_sota=50% and p_sota=80%, the p for Chinese models can be derived by fitting the curves.
For most rival pairs, EXCEPT kimi-k2-thinking v. Opus 4.5 at p50, it only requires 2 independent runs to match US SOTA model performance.
METR provides token usage for each task. By multiplying the token price and the average token usage for tasks within a certain time horizon, the average price count can be calculated.
Given Chinese models are significantly cheaper than OAI or Anthropic’s models, it is not surprising for this chart:
Please keep in mind that this evaluation is naive since a) we only have two Chinese models. b) best-of-n is only valid in easy-to-verify domains, like METR’s software engineering evaluations. and c) METR benchmark tasks are not production workloads
I did this research because i saw someone on X saying (sorry i cannot find the source rn) Chinese models are bs since they are less intelligent. Yes, but in easy-to-verify domains, one can simply run a ‘swarm’ of Chinese models in parallel and get things done eventually. Just like how Shein products flood the US market!
I’d find this more persuasive if you actually ran the evals, rather than just fit the curves to your model.
Naively I’d be surprised if “just run it twice” works for most tasks. Your math assumes structural independence of task completion, which would be surprising if true (and honestly invalidate METR results much more broadly than you’re implying, since it’d also imply that the Western models are much more capable of solving many tasks with longer time horizons than is currently implied).
(And separately I’m pretty sure the difference between best-of-one and best-of-n performance is not nearly as large as you’re implying, on other tasks. Though I’m not familiar with the literature on METR tasks specifically, the prior is strongly against)
Thanks for pointing out! i re-evaluated the data and my methodology is flawed.
METR provides task-level success rate. They run a model several time (4-8 typically) on one task. Here is the per-task success rate graph for tasks > 1h human completion time, using Opus 4.5, codex-5,1-max and kimi-k2 (made by sonnet 4.6)
the X-axis is ordered by human completion time (time horizon). beyond some certain point, kimi almost always fails. yet SOTA models could succeed (red blocks in heat map). For these tasks, running n times would not help much since they are likely beyond a model’s capabilities.
However, for tasks that both models can complete (the green-ish blocks), i still lean to assume the structural independence of runs, (e.g. one could run a task several times with a clean context window each time). My mental model for these tasks is like this slot machine meme. best-of-n sampling will improve the overall success rate on these tasks IMO.