I ran the horizon length graph with pass@8 instead, and the increase between GPT-4 and o1 seems to be slightly smaller (could be noise), and also Claude 3.7 does worse than o1. This means the doubling rate for pass@8 may be slightly slower than for pass@1. However, if horizon length increase since 2023 were only due to RL, the improvement from pass@8 would be barely half as much as the improvement in pass@1. It’s faster, which could be due to some combination of the following:
o1 has a better base model than GPT-4
HCAST is an example of “emergent capabilities on long-horizon tasks”, unlike their non-agentic tasks
RL helps more on HCAST skills than math or non-agentic coding
RL helps more on larger models
Some kind of nonlinearity in the relevant metrics
There are various problems with this graph (e.g. to follow the same methodology we should filter for models that are frontier on pass@8, not frontier on pass@1) but this was meant to be a quick and dirty check.
I ran the horizon length graph with pass@8 instead, and the increase between GPT-4 and o1 seems to be slightly smaller (could be noise), and also Claude 3.7 does worse than o1. This means the doubling rate for pass@8 may be slightly slower than for pass@1. However, if horizon length increase since 2023 were only due to RL, the improvement from pass@8 would be barely half as much as the improvement in pass@1. It’s faster, which could be due to some combination of the following:
o1 has a better base model than GPT-4
HCAST is an example of “emergent capabilities on long-horizon tasks”, unlike their non-agentic tasks
RL helps more on HCAST skills than math or non-agentic coding
RL helps more on larger models
Some kind of nonlinearity in the relevant metrics
There are various problems with this graph (e.g. to follow the same methodology we should filter for models that are frontier on pass@8, not frontier on pass@1) but this was meant to be a quick and dirty check.