Task duration as a Bradley–Terry score: an alternative to the constant hazard rate model
@Toby_Ord writes about the constant hazard rate model for task duration: a long task can be thought of as a sequence of many short subtasks of fixed difficulty, each of which must be completed to complete the overall task. This explains the approximately sigmoidal relationship between log(task horizon length) and the probability that a given model successfully completes the overall task.
I think this is a useful conceptual framing that explains the data reasonably well. But there is at least one alternative that explains the data about as well, which is to think of the task duration as being similar to a Bradley–Terry score, i.e., an exponential of an Elo rating.
The underlying intuition is that, in addition to having a larger number of subtasks, a longer task also has a higher probability of having a particularly hard subtask. We can crudely approximate the difficulty of a long task by the difficulty of its hardest subtask.
Concretely, consider any fixed random number distribution (e.g. uniform over [0,1]), representing the difficulty of a subtask. Assign to each task T a positive integer nT, and to each model M a positive integer nM. To decide whether M successfully completes T, we draw nT random numbers from our distribution for the task, and nM random numbers for the model. We then say that the task is completed if the model’s largest number exceeds the task’s largest number. Thus the probability of completion is
nMnM+nT=σ(lognM−lognT),
where σ(x)=11+e−x is the sigmoid function. This explains the sigmoidal relationship observed in Figure 5 of METR’s paper.
Toby’s model produces an exponential relationship, which is similar but slightly different to a sigmoid on a log scale. He argues that his relationship is preferred because it has only one free parameter instead of two. However, our model allows us to determine what one of the parameters of the sigmoidal relationship should be, by assuming that nT is proportional to the task duration. This predicts that the (negated) coefficient of the sigmoidal relationship should be around 1, assuming the natural log is applied to the task duration. At the very least, for a fixed task distribution, the coefficients should be similar for different models.
We can test this prediction by running the code used to produce Figure 5 to get the coefficient and intercept of the logistic regression.[1] Since the code applies a base-2 logarithm to the task duration, we can negate and divide the coefficient by the natural log of 2 to get the appropriately-scaled coefficient for our purposes:
Agent
Coefficient
Intercept
Coefficient / (-log(2))
Claude 3 Opus
-0.55
1.48
0.80
Claude 3.5 Sonnet (New)
-0.52
2.55
0.76
Claude 3.5 Sonnet (Old)
-0.55
2.31
0.80
Claude 3.7 Sonnet
-0.70
4.13
1.01
GPT-2
-0.49
-2.29
0.71
GPT-4 0125
-0.64
1.55
0.92
GPT-4 0314
-0.56
1.36
0.81
GPT-4 1106
-0.54
1.68
0.78
GPT-4 Turbo
-0.66
1.79
0.95
GPT-4o
-0.57
1.82
0.82
davinci-002 (GPT-3)
-0.65
-1.79
0.94
gpt-3.5-turbo-instruct
-0.78
-0.56
1.13
human
-0.39
2.55
0.56
o1
-0.51
2.70
0.74
o1-preview
-0.61
2.73
0.88
Even as the intercept varies considerably, the coefficient (divided by -log(2)) is relatively consistent and generally close to 1. It tends to be a little lower than 1, which is what you would expect if task duration measurements were noisy approximations to the value of nT, since this would flatten the slope of the sigmoid.
In reality, neither the constant hazard rate model nor the Bradley–Terry model is perfect. The constant hazard rate model fails to account for the fact that models can recover from small errors, while the Bradley–Terry model fails to account for the fact that models can fail because of subtasks that are easier than the hardest subtask.
The Bradley–Terry model has the advantage that it specifically explains why we might expect the relationship to be sigmoidal rather than approximately sigmoidal, and shows why we may need an extra free parameter to account for noisy measurements of task duration. It also more analogous to the scaling behavior of reinforcement learning observed previously in other settings, such as in Hex and Dota 2, where TrueSkill/Elo rating scales as a power law in compute. See in particular the toy model described in the Hex paper, which inspired the description I gave here:
The way in which performance scales with compute is that an agent with twice as much compute as its opponent can win roughly 2⁄3 of the time. This behaviour is strikingly similar to that of a toy model where each player chooses as many random numbers as they have compute, and the player with the highest number wins.
The ideal model would probably combine both aspects – that longer tasks have both more subtasks and harder subtasks. But this would have the downside of introducing more free parameters, and the data is likely to be too noisy to fit these in the near future. Overall, sticking to fitting two-parameter sigmoids is probably the way to go for now.
After installing the eval-analysis-public repo, I obtained these numbers by running the following command: mkdir data/wrangled/logistic_fits; python -m src.wrangle.logistic --fig-name headline --runs-file data/external/all_runs.jsonl --output-logistic-fits-file data/wrangled/logistic_fits/headline.csv --release-dates data/external/release_dates.yaml --bootstrap-file data/wrangled/bootstrap/headline.csv. Thanks to Nate Rush for help with this.
I’m a huge fan of Bradley-Terry models. I’m quite sure they are the natural way of representing noisy contests like chess ability and that Elo is an inferior way. They key thing with Bradley-Terry is that each competitor has a raw ability score (e.g. A and B) and that then when they have a contest the odds of A beating B is just A:B. I think of it as each player puts a number of tickets of their colour into a hat and then one is drawn at random determining the winner. This is an even simpler interpretation than the one from the Hex paper and makes the 2⁄3 result even more intuitive.
Elo then takes the log base 10 and multiplies by 400 and then adds 1200 or so to make the numbers usually positive, injecting three (!) arbitrary constants into the mix in order to give an additive scale that matched the pre-Elo chess rating scale — but the natural interpretation of these contests is a multiplicative scale (the ratio of the numbers is the odds ratio of winning) so it should have been left alone. Linear progress in Elo is really exponential progress in the raw quantity.
I like your idea of assuming random difficulties for the different tasks (from some distribution that could be tweaked), as clearly that is part of the real underlying phenomenon. However, it is weird that you compare the highest number the agent draws from the hat to the highest number the task draws. More natural would be to have to take on the tasks one by one in a gauntlet of challenges of varying difficulty. e.g. that the probability of success is Πn1pi instead of my Πn1p where pi is a random variable drawn from some natural distribution over [0,1] that is modified by the agent’s skill and represents the probability of succeeding at that subtask. There should be limiting cases where all pi are equal (my case) and where it is driven by the hardest one. But I’m not sure what distribution this creates.
That said, I like where you are going with this and how you eliminate one of the parameters.
I definitely see my constant hazard rate model as a first order approximation to what is going on, and not the full story. I’m surprised it works as well as it does because the underlying phenomenon has more structure than this. So I see it just as something of a null hypothesis for other approaches to beat, and do expect it to eventually be beaten.
The gradient of σ(−βx) at x=0 is −14β, which corresponds to a maximally negative slope of −ln(2)4β≈−17%×β per doubling, where β≈1 is the rightmost column in my table.
Yes, unless I messed up, METR’s code runs a logistic regression of log2(task duration) against success probability, so my model predicts a raw fitted coefficient (the second column in the table) close to -ln(2) ≈ −0.69.
Oh right sorry I missed the derivation that among nM+nT samples, the maximum is equally likely to be any of them and so the probability that the largest number from the model the largest of them is
nMnM+nT=11+nT/nM=11+exp(−(lognM−lognT))
This model then predicts that models “ELO ratings” - lognM would grow linearly over time, which (based on this chart GPT5 gave me) I think corresponds roughly with the progress in chess from 2007 onwards
It also makes the quantitative prediction that a doubling in compute (or compute efficiency) leads to a 2⁄3 win probability, or around 120 Elo points. (Credit to the Hex paper for this observation.) Under 18-month doublings (per one version of Moore’s law), this would be around 800 Elo points per decade, which looks like a bit of an overestimate but similar to the fastest observed rate of progress.
Task duration as a Bradley–Terry score: an alternative to the constant hazard rate model
@Toby_Ord writes about the constant hazard rate model for task duration: a long task can be thought of as a sequence of many short subtasks of fixed difficulty, each of which must be completed to complete the overall task. This explains the approximately sigmoidal relationship between log(task horizon length) and the probability that a given model successfully completes the overall task.
I think this is a useful conceptual framing that explains the data reasonably well. But there is at least one alternative that explains the data about as well, which is to think of the task duration as being similar to a Bradley–Terry score, i.e., an exponential of an Elo rating.
The underlying intuition is that, in addition to having a larger number of subtasks, a longer task also has a higher probability of having a particularly hard subtask. We can crudely approximate the difficulty of a long task by the difficulty of its hardest subtask.
Concretely, consider any fixed random number distribution (e.g. uniform over [0,1]), representing the difficulty of a subtask. Assign to each task T a positive integer nT, and to each model M a positive integer nM. To decide whether M successfully completes T, we draw nT random numbers from our distribution for the task, and nM random numbers for the model. We then say that the task is completed if the model’s largest number exceeds the task’s largest number. Thus the probability of completion is
nMnM+nT=σ(lognM−lognT),where σ(x)=11+e−x is the sigmoid function. This explains the sigmoidal relationship observed in Figure 5 of METR’s paper.
Toby’s model produces an exponential relationship, which is similar but slightly different to a sigmoid on a log scale. He argues that his relationship is preferred because it has only one free parameter instead of two. However, our model allows us to determine what one of the parameters of the sigmoidal relationship should be, by assuming that nT is proportional to the task duration. This predicts that the (negated) coefficient of the sigmoidal relationship should be around 1, assuming the natural log is applied to the task duration. At the very least, for a fixed task distribution, the coefficients should be similar for different models.
We can test this prediction by running the code used to produce Figure 5 to get the coefficient and intercept of the logistic regression.[1] Since the code applies a base-2 logarithm to the task duration, we can negate and divide the coefficient by the natural log of 2 to get the appropriately-scaled coefficient for our purposes:
Even as the intercept varies considerably, the coefficient (divided by -log(2)) is relatively consistent and generally close to 1. It tends to be a little lower than 1, which is what you would expect if task duration measurements were noisy approximations to the value of nT, since this would flatten the slope of the sigmoid.
In reality, neither the constant hazard rate model nor the Bradley–Terry model is perfect. The constant hazard rate model fails to account for the fact that models can recover from small errors, while the Bradley–Terry model fails to account for the fact that models can fail because of subtasks that are easier than the hardest subtask.
The Bradley–Terry model has the advantage that it specifically explains why we might expect the relationship to be sigmoidal rather than approximately sigmoidal, and shows why we may need an extra free parameter to account for noisy measurements of task duration. It also more analogous to the scaling behavior of reinforcement learning observed previously in other settings, such as in Hex and Dota 2, where TrueSkill/Elo rating scales as a power law in compute. See in particular the toy model described in the Hex paper, which inspired the description I gave here:
The ideal model would probably combine both aspects – that longer tasks have both more subtasks and harder subtasks. But this would have the downside of introducing more free parameters, and the data is likely to be too noisy to fit these in the near future. Overall, sticking to fitting two-parameter sigmoids is probably the way to go for now.
After installing the eval-analysis-public repo, I obtained these numbers by running the following command:
mkdir data/wrangled/logistic_fits; python -m src.wrangle.logistic --fig-name headline --runs-file data/external/all_runs.jsonl --output-logistic-fits-file data/wrangled/logistic_fits/headline.csv --release-dates data/external/release_dates.yaml --bootstrap-file data/wrangled/bootstrap/headline.csv. Thanks to Nate Rush for help with this.Thanks for this Jacob — excellent analysis.
I’m a huge fan of Bradley-Terry models. I’m quite sure they are the natural way of representing noisy contests like chess ability and that Elo is an inferior way. They key thing with Bradley-Terry is that each competitor has a raw ability score (e.g. A and B) and that then when they have a contest the odds of A beating B is just A:B. I think of it as each player puts a number of tickets of their colour into a hat and then one is drawn at random determining the winner. This is an even simpler interpretation than the one from the Hex paper and makes the 2⁄3 result even more intuitive.
Elo then takes the log base 10 and multiplies by 400 and then adds 1200 or so to make the numbers usually positive, injecting three (!) arbitrary constants into the mix in order to give an additive scale that matched the pre-Elo chess rating scale — but the natural interpretation of these contests is a multiplicative scale (the ratio of the numbers is the odds ratio of winning) so it should have been left alone. Linear progress in Elo is really exponential progress in the raw quantity.
I like your idea of assuming random difficulties for the different tasks (from some distribution that could be tweaked), as clearly that is part of the real underlying phenomenon. However, it is weird that you compare the highest number the agent draws from the hat to the highest number the task draws. More natural would be to have to take on the tasks one by one in a gauntlet of challenges of varying difficulty. e.g. that the probability of success is Πn1pi instead of my Πn1p where pi is a random variable drawn from some natural distribution over [0,1] that is modified by the agent’s skill and represents the probability of succeeding at that subtask. There should be limiting cases where all pi are equal (my case) and where it is driven by the hardest one. But I’m not sure what distribution this creates.
That said, I like where you are going with this and how you eliminate one of the parameters.
I definitely see my constant hazard rate model as a first order approximation to what is going on, and not the full story. I’m surprised it works as well as it does because the underlying phenomenon has more structure than this. So I see it just as something of a null hypothesis for other approaches to beat, and do expect it to eventually be beaten.
How does this coefficient relate to the maximal slope (i.e. at the 50%-x)?
The gradient of σ(−βx) at x=0 is −14β, which corresponds to a maximally negative slope of −ln(2)4β≈−17%×β per doubling, where β≈1 is the rightmost column in my table.
Thanks!
In figure 5 the X axis is log time horizon and not time horizon—does this fit with your model?
Yes, unless I messed up, METR’s code runs a logistic regression of log2(task duration) against success probability, so my model predicts a raw fitted coefficient (the second column in the table) close to -ln(2) ≈ −0.69.
Oh right sorry I missed the derivation that among nM+nT samples, the maximum is equally likely to be any of them and so the probability that the largest number from the model the largest of them is
nMnM+nT=11+nT/nM=11+exp(−(lognM−lognT))
This model then predicts that models “ELO ratings” - lognM would grow linearly over time, which (based on this chart GPT5 gave me) I think corresponds roughly with the progress in chess from 2007 onwards
It also makes the quantitative prediction that a doubling in compute (or compute efficiency) leads to a 2⁄3 win probability, or around 120 Elo points. (Credit to the Hex paper for this observation.) Under 18-month doublings (per one version of Moore’s law), this would be around 800 Elo points per decade, which looks like a bit of an overestimate but similar to the fastest observed rate of progress.