Thanks for pointing out! i re-evaluated the data and my methodology is flawed.
METR provides task-level success rate. They run a model several time (4-8 typically) on one task. Here is the per-task success rate graph for tasks > 1h human completion time, using Opus 4.5, codex-5,1-max and kimi-k2 (made by sonnet 4.6)
the X-axis is ordered by human completion time (time horizon). beyond some certain point, kimi almost always fails. yet SOTA models could succeed (red blocks in heat map). For these tasks, running n times would not help much since they are likely beyond a model’s capabilities.
However, for tasks that both models can complete (the green-ish blocks), i still lean to assume the structural independence of runs, (e.g. one could run a task several times with a clean context window each time). My mental model for these tasks is like this slot machine meme. best-of-n sampling will improve the overall success rate on these tasks IMO.
Thanks for pointing out! i re-evaluated the data and my methodology is flawed.
METR provides task-level success rate. They run a model several time (4-8 typically) on one task. Here is the per-task success rate graph for tasks > 1h human completion time, using Opus 4.5, codex-5,1-max and kimi-k2 (made by sonnet 4.6)
the X-axis is ordered by human completion time (time horizon). beyond some certain point, kimi almost always fails. yet SOTA models could succeed (red blocks in heat map). For these tasks, running n times would not help much since they are likely beyond a model’s capabilities.
However, for tasks that both models can complete (the green-ish blocks), i still lean to assume the structural independence of runs, (e.g. one could run a task several times with a clean context window each time). My mental model for these tasks is like this slot machine meme. best-of-n sampling will improve the overall success rate on these tasks IMO.