Yes, the fact that we have concrete metrics for progress is part of why we are excited about this line of work. We suspect that if we were given algorithms that achieve very low MSE for a given FLOP budget that we’d be able to extract useful insights from them, although it remains to be seen how well this will pan out in practice. We’re planning to launch a contest soon to test out this idea, with LLM usage encouraged. (It will be slightly less prescriptive than the setup you suggested, code just has to take in network weights and produce expectations.)
Jacob_Hilton
Announcing the ARC White-Box Estimation Challenge
Mechanistic estimation for expectations of random products
Indeed, tensor network diagrams show up in our algorithm (see Appendix A of the paper). We’ve also been thinking about mechanistic estimation for tensor network contractions as a problem in their own right, partly because they appear to be needed for harder MLP cases.
Mechanistic estimation for wide random MLPs
I agree that there are qualitative similarities, so perhaps we should be quantitative about it. Assuming for the sake of argument that the DoW were acting in bad faith and plans to use OpenAI’s services to conduct domestic mass surveillance (legally), how likely do you think it is that OpenAI would be able to prevent this? Given the difficulties I mentioned (indistinguishable from innocuous use, problematic only in aggregate, novel setting, classified, ZDR, no meaningful contractual recourse), it would seem like a big stretch to reach ~50% confidence in my opinion, even with considerable effort on OpenAI’s part.
Perhaps you think it’s unlikely that the DoW is acting in bad faith, but if so, it’s good to be clear about whether this is a load-bearing assumption.
FWIW, I think jailbreaking is less of a concern than mass surveillance activity being simply indistinguishable from innocuous use, since without surrounding context it could look like ordinary data analysis. Perhaps it could be detected from large-scale patterns of usage, but this would be quite different from settings like bio/cyber, and it seems rough for OpenAI’s first real-world attempt at this to be in a classified ZDR setting, with no meaningful contractual recourse if detection or targeted blocking turns out to be harder than you predict.
I am sympathetic to the case that it could still be worth taking the contract to support the government’s use of AI (modulo not pushing back more on the SCR designation before doing so), but I don’t agree with the presentation of the technical challenge as familiar territory.
Can you turn this argument into a mechanistic estimate of the model’s accuracy? (You’d need to do things like deduce correlations from the weights, rather than just observe them empirically—but it seems like you’re getting close.)
Good start!
AlgZoo: uninterpreted models with fewer than 1,500 parameters
ARC progress update: Competing with sampling
Nice observation, and I agree with your calculation that linear episode length growth would account for a worse scaling exponent by a factor of 2 (or more generally, episode length growing with exponent k would account for a worse scaling exponent by a factor of k+1).
Note also that this suggests a potential remedy, namely controlling episode length, but there is less incentive to apply this when data is more of a constraint than compute.
Thanks for this insightful analysis!
But it fits with the extreme information inefficiency of RL training, which (compared to next-token-prediction) receives less than a ten-thousandth as much information to learn from per FLOP of training compute.
If I am interpreting this correctly, there is a subtle mathematical error here: if RL requires a constant factor of 10,000 more compute than pretraining, this only shifts the graph of performance against log(compute), it doesn’t change its slope. For RL to have a shallower slope, the information efficiency would have to decrease more quickly over the course of training for RL than for pretraining.
I think there are few potential reasons why information efficiency might decrease more quickly over the course of training for RL than for pretraining, but it is not so clear-cut:
Increased accuracy: you get fewer bits of information from a more biased coin flip than a fairer one, so information efficiency decreases as you approach 100% accuracy. But it’s not clear whether this applies more to pretraining or to RL. Note also that in both cases the effect can potentially be alleviated by a curriculum.
Longer episodes: assuming RL just has a single binary reward at the end of each episode, information density decreases as episodes get longer. Since harder tasks require longer chains of thought, this one seems to clearly count against RL.
Overfitting: if there is a mismatch between the training distribution used for RL and the distribution used to benchmark the model, one might expect the density of information relevant to the benchmark to decrease as the model overfits to the training distribution. I think this one also counts against RL right now, but can be alleviated by improving data quality and quantity.
In particular, I think the fact that overfitting can be mitigated with better data cuts against your empirical observations. Since, as you correctly note, RL compute started from a very small base, it was initially much cheaper to scale up compute than to scale up data. But as RL compute becomes more expensive, it will become comparatively more cost-effective to scale up data. Once spending on both is being scaled up at a similar rate (as is economically inevitable as long as spending continues to increase), we should expect to see some regression towards the pretraining slope in my opinion.
Overall, I think the effect you spotted is real (due to things like episode length), but ultimately won’t turn out to be as extreme as you estimated here. Quantitatively, I would guess that RL will look more like a power of 1.5-2 worse than pretraining rather a power of 3 worse, and there could be certain training regimes (e.g. fixed episode length) where they are closer than that.
It also makes the quantitative prediction that a doubling in compute (or compute efficiency) leads to a 2⁄3 win probability, or around 120 Elo points. (Credit to the Hex paper for this observation.) Under 18-month doublings (per one version of Moore’s law), this would be around 800 Elo points per decade, which looks like a bit of an overestimate but similar to the fastest observed rate of progress.
The gradient of at is , which corresponds to a maximally negative slope of per doubling, where is the rightmost column in my table.
Yes, unless I messed up, METR’s code runs a logistic regression of (task duration) against success probability, so my model predicts a raw fitted coefficient (the second column in the table) close to -ln(2) ≈ −0.69.
Task duration as a Bradley–Terry score: an alternative to the constant hazard rate model
@Toby_Ord writes about the constant hazard rate model for task duration: a long task can be thought of as a sequence of many short subtasks of fixed difficulty, each of which must be completed to complete the overall task. This explains the approximately sigmoidal relationship between log(task horizon length) and the probability that a given model successfully completes the overall task.
I think this is a useful conceptual framing that explains the data reasonably well. But there is at least one alternative that explains the data about as well, which is to think of the task duration as being similar to a Bradley–Terry score, i.e., an exponential of an Elo rating.
The underlying intuition is that, in addition to having a larger number of subtasks, a longer task also has a higher probability of having a particularly hard subtask. We can crudely approximate the difficulty of a long task by the difficulty of its hardest subtask.
Concretely, consider any fixed random number distribution (e.g. uniform over [0,1]), representing the difficulty of a subtask. Assign to each task a positive integer , and to each model a positive integer . To decide whether successfully completes , we draw random numbers from our distribution for the task, and random numbers for the model. We then say that the task is completed if the model’s largest number exceeds the task’s largest number. Thus the probability of completion is
where is the sigmoid function. This explains the sigmoidal relationship observed in Figure 5 of METR’s paper.
Toby’s model produces an exponential relationship, which is similar but slightly different to a sigmoid on a log scale. He argues that his relationship is preferred because it has only one free parameter instead of two. However, our model allows us to determine what one of the parameters of the sigmoidal relationship should be, by assuming that is proportional to the task duration. This predicts that the (negated) coefficient of the sigmoidal relationship should be around 1, assuming the natural log is applied to the task duration. At the very least, for a fixed task distribution, the coefficients should be similar for different models.
We can test this prediction by running the code used to produce Figure 5 to get the coefficient and intercept of the logistic regression.[1] Since the code applies a base-2 logarithm to the task duration, we can negate and divide the coefficient by the natural log of 2 to get the appropriately-scaled coefficient for our purposes:
Agent Coefficient Intercept Coefficient / (-log(2)) Claude 3 Opus -0.55 1.48 0.80 Claude 3.5 Sonnet (New) -0.52 2.55 0.76 Claude 3.5 Sonnet (Old) -0.55 2.31 0.80 Claude 3.7 Sonnet -0.70 4.13 1.01 GPT-2 -0.49 -2.29 0.71 GPT-4 0125 -0.64 1.55 0.92 GPT-4 0314 -0.56 1.36 0.81 GPT-4 1106 -0.54 1.68 0.78 GPT-4 Turbo -0.66 1.79 0.95 GPT-4o -0.57 1.82 0.82 davinci-002 (GPT-3) -0.65 -1.79 0.94 gpt-3.5-turbo-instruct -0.78 -0.56 1.13 human -0.39 2.55 0.56 o1 -0.51 2.70 0.74 o1-preview -0.61 2.73 0.88 Even as the intercept varies considerably, the coefficient (divided by -log(2)) is relatively consistent and generally close to 1. It tends to be a little lower than 1, which is what you would expect if task duration measurements were noisy approximations to the value of , since this would flatten the slope of the sigmoid.
In reality, neither the constant hazard rate model nor the Bradley–Terry model is perfect. The constant hazard rate model fails to account for the fact that models can recover from small errors, while the Bradley–Terry model fails to account for the fact that models can fail because of subtasks that are easier than the hardest subtask.
The Bradley–Terry model has the advantage that it specifically explains why we might expect the relationship to be sigmoidal rather than approximately sigmoidal, and shows why we may need an extra free parameter to account for noisy measurements of task duration. It also more analogous to the scaling behavior of reinforcement learning observed previously in other settings, such as in Hex and Dota 2, where TrueSkill/Elo rating scales as a power law in compute. See in particular the toy model described in the Hex paper, which inspired the description I gave here:
The way in which performance scales with compute is that an agent with twice as much compute as its opponent can win roughly 2⁄3 of the time. This behaviour is strikingly similar to that of a toy model where each player chooses as many random numbers as they have compute, and the player with the highest number wins.
The ideal model would probably combine both aspects – that longer tasks have both more subtasks and harder subtasks. But this would have the downside of introducing more free parameters, and the data is likely to be too noisy to fit these in the near future. Overall, sticking to fitting two-parameter sigmoids is probably the way to go for now.
- ^
After installing the eval-analysis-public repo, I obtained these numbers by running the following command:
mkdir data/wrangled/logistic_fits; python -m src.wrangle.logistic --fig-name headline --runs-file data/external/all_runs.jsonl --output-logistic-fits-file data/wrangled/logistic_fits/headline.csv --release-dates data/external/release_dates.yaml --bootstrap-file data/wrangled/bootstrap/headline.csv. Thanks to Nate Rush for help with this.
- ^
Agree about recent results not being driven by formalization, but I’d also guess having ground truth (e.g. numeric answers or reference solutions) remains pretty important, which doesn’t scale to the superhuman regime.
Agree that evidence from humans means reaching superhuman capability through purely informal proof is possible in principle. But ML is less robust than humans by default, and AI is already more proficient with formal proof systems than most mathematicians. So informal-to-formal seems like a natural consequence of increased tool use. Not confident in this of course.
I expect easy-to-check software engineering tasks (and tasks that are conceptually similar to easy-to-check tasks) to be pretty close to math, and harder-to-check/fuzzier tasks to lag. Most tasks in the broad economy seem like they fall in the latter category. The economy will likely adapt to make lots of tasks better suited to AI, but that process may be slower than the capability lag anyway. AI R&D might be a different story, but I will leave that to another discussion.
Superhuman math AI will plausibly arrive significantly before broad automation
I think it’s plausible that for several years in the late 2020s/early 2030s, we will have AI that is vastly superhuman at formal domains including math, but still underperforms humans at most white-collar jobs (and so world GDP growth remains below 10%/year, say – still enough room for AI to be extraordinarily productive compared to today).
Of course, if there were to be an intelligence explosion on that timescale, then superhuman math AI would be unsurprising. My main point is that superhuman math AI still seems plausible even disregarding feedback loops from automation of AI R&D. On the flip side, a major catastrophe and/or coordinated slowdown could prevent both superhuman math AI and broad automation. Since both of these possibilities are widely discussed elsewhere, I will disregard both AI R&D feedback loops and catastrophe for the purposes of this forecast. (I think this is a very salient possibility on the relevant timescale, but won’t justify that here.)
My basic reasons for thinking vastly superhuman math AI is a serious possibility in the next 4–8 years (even absent AI R&D feedback loops and/or catastrophe):
Performance in formal domains is verifiable: math problems can be designed to have a unique correct answer, and formal proofs are either valid or invalid. Historically, in domains with cheap, automated supervision signals, only a relatively small amount of research effort has been required to produce superhuman AI (e.g., in board games and video games). There are often other bottlenecks than supervision, most notably exploration and curricula, but these tend to be more surmountable.
Recent historical progress in math has been extraordinarily fast: in the last 4 years, AI has gone from struggling with grade school math to achieving an IMO gold medal, with progress at times exceeding almost all forecasters’ reasonable expectations. Indeed, much of this progress seems to have been driven by the ability to automatically supervise math, with reasoning models being trained using RL on a substantial amount of math data.
Superhuman math AI looks within reach without enormous expense: reaching superhuman ability in a domain requires verifying solutions beyond a human’s ability to produce them, and so a static dataset produced by humans isn’t enough. (In fact, a temporary slowdown in math progress in the near future seems possible because of this, although I wouldn’t bet on it.) But the following two ingredients (plus sufficient scale) seem sufficient for superhuman math AI, and within reach:
Automatic problem generation: the ability to generate a diverse enough set of problems such that both (a) most realistic math of interest to humans is within distribution and (b) problem difficulty is granular enough to provide a good curriculum. Current LLMs with careful prompting/fine-tuning may be enough for this.
Reliable informal-to-formal translation: solution verifiers need to be robust enough to avoid too much reward hacking, which probably requires natural language problems and solutions to be formalized to some degree (a variety of arrangements seem possible here, but it’s hard to see how something purely informal can provide sufficiently scalable supervision, and it’s hard to see how something purely formal can capture mathematicians’ intuitions about what problems are interesting). This is basically a coding problem, and doesn’t seem too far beyond the capabilities of current LLMs. Present-day formalization efforts by humans are challenging, but in large part because of their laboriousness, which AI is excellent at dealing with.
Note I’m not claiming that there will be discontinuous progress once these ingredients “click into place”. Instead, I expect math progress to continue on a fast but relatively continuous trajectory (perhaps with local breakthroughs/temporary slowdowns on the order of a year or two). The above two ingredients don’t seem especially responsible for current math capabilities, but could become increasingly relevant as we move towards and into the superhuman regime.
By contrast, some reasons to be skeptical that AI will be automating more than a few percent of the economy by 2033 (still absent AI R&D feedback loops and/or catastrophe):
Progress in domains in which performance is hard to verify has been slower: by comparison with the dramatic progress in math, the ability of an AI to manage a small business enterprise is relatively unimpressive. In domains with a mixture of formal and informal problem specifications, such as coding, progress has been similarly fast to math, or perhaps a little slower (as measured by horizon length), but my qualitative impression is that has been driven by progress on easy-to-verify tasks, with some transfer to hard-to-verify tasks. I expect to continue to see domains lag behind based on the extent to which performance is easy to verify.
Possible need for expensive long-horizon data: in domains with fuzzy, informal problem specifications, or requiring expensive or long-horizon feedback from the real world, we will continue to see improvements, since there will be transfer both from pretraining scaling and from more RL on verifiable tasks. But for tasks where this progress is slow despite the task being economically important, it will eventually be worth it to collect expensive long-horizon feedback. However, it might take several years to scale up the necessary infrastructure for this, unlike some clear routes to superhuman math AI, for which all the necessary infrastructure is essentially already in place. This makes a 2–5+ year lag seem quite plausible.
Naive revenue extrapolation: one way to get a handle on the potential timescale until broad automation is to extrapolate AI company revenues, which are on the order of tens of billions of dollars per year today, around 0.01% of world GDP. Even using OpenAI’s own projections (despite their incentives to make overestimates), which forecast that revenue will grow by a factor of 10 over the next 4 years, and extrapolating them an additional 4 years into the future, gives an estimate of around 1% of world GDP by 2033. AI companies won’t capture all the economic value they create, but on the other hand this is a very bullish forecast by ordinary standards.
What would a world with vastly superhuman math AI, but relatively little broad automation, look like? Some possibilities:
Radical effect on formal sciences: by “vastly superhuman math AI”, I mean something like: you can give an AI a math problem, and it will respond within e.g. a couple of hours with a formal proof or disproof, as long as a human mathematician could have found an informal version of the proof in say 10 years. (Even though I just argued for the plausibility of this, it seems completely wild to comprehend, spelled out explicitly.) I think this would completely upend the formal sciences (math, theoretical computer science and theoretical physics) to say the least. Progress on open problems would be widespread but highly variable, since their difficulty likely ranges from “just out of reach to current mathematicians” to “impossible”.
Noticeable speed-up of applied sciences: it’s not clear that such a dramatic speed-up in the formal sciences would have that dramatic consequences for the rest of the world, given how abstract much of it is. Cryptography, formal verification and programming languages might be the most consequential areas, followed by areas like experimental physics and computational chemistry. However, in most of the experimental sciences, formal results are not the main bottleneck, so speed-ups would be more dependent on progress on coding, fuzzier tasks, robotics, and so on. Math-heavy theoretical AI alignment research would be significantly sped up, but may still face philosophical hurdles.
Broader economy: it’s worth emphasizing that even if world GDP growth remains below 10%/year, that still leaves plenty of room for AI to feel “crazy”, labor markets to be dramatically affected by ordinary standards, political discussion to be dominated by AI, etc. Note also that this period may be fairly short-lived (e.g., a few years).
Such a scenario is probably poor as an all-things-considered conditional forecast, since I’ve deliberately focused on a very specific technological change, but it hopefully adds some useful color to my prediction.
Finally, some thoughts on whether pursuing superhuman math AI specifically is a beneficial research direction:
Possibility for transfer: there is a significant possibility that math reasoning ability transfers to other capabilities; indeed, we may already be seeing this in today’s reasoning models (though I haven’t looked at ablation results). That being said, moving into the superhuman regime and digging into specialist areas, math ability will increasingly be driven by carefully-tuned specialist intuitions, especially if pursuing something like the informal-to-formal approach laid out above. Moreover, specialized math ability seems to have limited transfer in humans, and transfer in ML is generally considerably worse than in humans. Overall, this doesn’t seem like a dominant consideration.
Research pay-offs: a different kind of “transfer” is that pursuit of superhuman math AI would likely lead to general ML research discoveries, clout, PR etc., making it easier to develop other AI capabilities. I think this is an important consideration, and probably the main reason that AI companies have prioritized math capabilities so far, together with tractability. However, pursuing superhuman math AI doesn’t seem that different from other capabilities research in this regard, so the question of how good it is in this respect is mostly screened off by how good you think it is to work on capabilities in general (which could itself depend on the context/company).
Differential progress: the kinds of scientific progress that superhuman math AI would enable look more defense-oriented than average (e.g., formal verification), and I think the possibility of speeding up theoretical AI alignment research is significant (I work in this area and math AI is already helping).
Replaceability: there are strong incentives for AI companies and for individual researchers to pursue superhuman math AI anyway (e.g., the research pay-offs discussed above), which reduces the size (in either direction) of the marginal impact of an individual choosing to work in the area.
Overall, pursuing superhuman math AI seems mildly preferable to working on other capabilities, but not that dissimilar in its effects. It wouldn’t be my first choice for most people with the relevant skillset, unless they were committed to working on capabilities anyway.
Thanks! Trained MLPs are somewhat trickier to design a contest around, since once you have a well-defined distribution over trained models, people can spend a lot of compute offline empirically fitting predictors for the expected output as a function of the model parameters, and it would be much harder to outperform these with mechanistic approaches. But it’s good to know this may be more motivating, and we may consider it.
The existence of such an algorithm is a special case of a broader conjecture that we have been calling the “matching sampling principle” (specifically the train-and-explain version, as described here). Our evidence for this conjecture is mainly based on examples where it appears to hold, a lack of compelling counterexamples, and more abstract philosophical reasoning. The specific research bet is a judgment call based on the plausibility of the overall approach, tractability, value of information, etc. (Sorry to be vague, it would take a lot of work to give a very detailed answer to your question, and we are hoping to say more about all of this in the not-too-distant future.)