That you think are sufficient to meaningfully change how valid people should interpret the overall predicted trend?
I’m not Adam, but my response is “No”, based on the description Megan copied in thread and skimming some of the paper. It’s good that the paper includes those experiments, but they don’t really speak to the concerns Adam is discussing. Those concerns, as I see it (I could be misunderstanding):
Conceptual coherence: in humans there are different skills, e.g., between different fields, that don’t seem to easily project onto a time horizon dimension. Or like, our sense of how much intelligence is required for them or how difficult they are does not correspond all that closely with the time taken to do them.
Benchmark bias: solution criteria is known and progress criteria is often known; big jump from that to the real world scary things we’re worried about.
Do the experiments in Sec 6 deal with this?
No SWAA (“Retrodiction from 2023–2025 data”): Does not deal with 2. Mostly does not deal with 1, as both HCAST + RE-Bench and All-3 are mostly sofware engineerig dominated with a little bit of other stuff.
Messiness factors: Does not speak to 1. This is certainly relevant to 2, but I don’t think it’s conclusive. Quoting from the paper some:
We rated HCAST and RE-Bench tasks on 16 properties that we expected to be 1) representative of how real world tasks might be systematically harder than our tasks and 2) relevant to AI agent performance. Some example factors include whether the task involved a novel situation, was constrained by a finite resource, involved real-time coordination, or was sourced from a real-world context. We labeled RE-bench and HCAST tasks on the presence or absence of these 16 messiness factors, then summed these to obtain a “messiness score” ranging from 0 to 16. Factor definitions can be found in Appendix D.4.
The mean messiness score amongst HCAST and RE-Bench tasks is 3.2/16. None of these tasks have a messiness score above 8⁄16. For comparison, a task like ’write a good research paper’ would score between 9⁄16 and 15⁄16, depending on the specifics of the task.
On HCAST tasks, AI agents do perform worse on messier tasks than would be predicted from the task’s length alone (b=-0.081, R2 = 0.251) …
However, trends in AI agent performance over time are similar for lower and higher messiness subsets of our tasks.
This seems like very weak evidence in favor of the hypothesis that Benchmark Bias is a big deal. But they just don’t have very messy tasks.
c. SWE-Bench Verified: doesn’t speak to 1 or 2.
d. Internal PR experiments: Maybe speaks a little to 1 and 2 because these are more real world, closer to the thing we care about tasks, but not much, as they’re still clearly verifiable and still software engineering.
I do think Thomas and Vincent’s follow up work here on time horizons for other domains is useful evidence pointing a little against the conceptual coherence objection. But only a little.
I guess my understanding is more that the conceptual coherence objection isn’t an objection to the predictive accuracy of the trend, which is why I had brought up the scaling law / pretraining loss / downstream task analogy.
As far as I understand, the messiness relates to the Benchmark Bias objection as far as predicting performance at any given point in time, but not the actual trend, given the trend was similar for lower and higher messiness tasks.
Is your intuition that the trend is significantly (like more than their CI) wrong as well? Or that it’s just the performance prediction at a given point in time? Or is the question ill formed / undefined?
We care about the performance prediction at a given point in time for skills like “take over the world”, “invent new science”, and “do RSI” (and “automate AI R&D”, which I think the benchmark does speak to). We would like to know when those skills will be developed.
In the frame of this benchmark, and Thomas and Vincent’s follow up work, it seems like we’re facing down at least three problems:
The original time horizons tasks are clearly out of the distribution we care about. Solution: create a new task suite we think is the right distribution.
We don’t know how well time horizons will do at predicting future capabilities, even in this domain. Solution: keep collecting new data as it comes out in order to test predictions on whatever distributions we have, examine things like the conceptual coherence objection and try to make progress.
We don’t know how well the general “time horizons” approach works across domains. We have some data on this in the follow up work, maybe it’s a 2:1 update from a 1:1 prior?
So my overall take is that I think the current work I’m aware of tells us
Small positive update on time horizons being predictive at all.
A small positive update on the specific Software Engineering trends being predictive within distribution.
Small positive update on “time horizons” being common across different reasonable and easy to define distributions.
And on “doubling time in the single digit months” being the rate of time horizon increase across many domains.
A small negative update on the specific time horizon length from one task distribution generalizing to other task distributions (maybe an update, tbh the prior is much lower than 50⁄50). So it tells us approximately nothing about the performance prediction at a given point in time for the capabilities I care about.
I’m not Adam, but my response is “No”, based on the description Megan copied in thread and skimming some of the paper. It’s good that the paper includes those experiments, but they don’t really speak to the concerns Adam is discussing. Those concerns, as I see it (I could be misunderstanding):
Conceptual coherence: in humans there are different skills, e.g., between different fields, that don’t seem to easily project onto a time horizon dimension. Or like, our sense of how much intelligence is required for them or how difficult they are does not correspond all that closely with the time taken to do them.
Benchmark bias: solution criteria is known and progress criteria is often known; big jump from that to the real world scary things we’re worried about.
Do the experiments in Sec 6 deal with this?
No SWAA (“Retrodiction from 2023–2025 data”): Does not deal with 2. Mostly does not deal with 1, as both HCAST + RE-Bench and All-3 are mostly sofware engineerig dominated with a little bit of other stuff.
Messiness factors: Does not speak to 1. This is certainly relevant to 2, but I don’t think it’s conclusive. Quoting from the paper some:
This seems like very weak evidence in favor of the hypothesis that Benchmark Bias is a big deal. But they just don’t have very messy tasks.
c. SWE-Bench Verified: doesn’t speak to 1 or 2.
d. Internal PR experiments: Maybe speaks a little to 1 and 2 because these are more real world, closer to the thing we care about tasks, but not much, as they’re still clearly verifiable and still software engineering.
I do think Thomas and Vincent’s follow up work here on time horizons for other domains is useful evidence pointing a little against the conceptual coherence objection. But only a little.
I guess my understanding is more that the conceptual coherence objection isn’t an objection to the predictive accuracy of the trend, which is why I had brought up the scaling law / pretraining loss / downstream task analogy.
As far as I understand, the messiness relates to the Benchmark Bias objection as far as predicting performance at any given point in time, but not the actual trend, given the trend was similar for lower and higher messiness tasks.
Is your intuition that the trend is significantly (like more than their CI) wrong as well? Or that it’s just the performance prediction at a given point in time? Or is the question ill formed / undefined?
We care about the performance prediction at a given point in time for skills like “take over the world”, “invent new science”, and “do RSI” (and “automate AI R&D”, which I think the benchmark does speak to). We would like to know when those skills will be developed.
In the frame of this benchmark, and Thomas and Vincent’s follow up work, it seems like we’re facing down at least three problems:
The original time horizons tasks are clearly out of the distribution we care about. Solution: create a new task suite we think is the right distribution.
We don’t know how well time horizons will do at predicting future capabilities, even in this domain. Solution: keep collecting new data as it comes out in order to test predictions on whatever distributions we have, examine things like the conceptual coherence objection and try to make progress.
We don’t know how well the general “time horizons” approach works across domains. We have some data on this in the follow up work, maybe it’s a 2:1 update from a 1:1 prior?
So my overall take is that I think the current work I’m aware of tells us
Small positive update on time horizons being predictive at all.
A small positive update on the specific Software Engineering trends being predictive within distribution.
Small positive update on “time horizons” being common across different reasonable and easy to define distributions.
And on “doubling time in the single digit months” being the rate of time horizon increase across many domains.
A small negative update on the specific time horizon length from one task distribution generalizing to other task distributions (maybe an update, tbh the prior is much lower than 50⁄50). So it tells us approximately nothing about the performance prediction at a given point in time for the capabilities I care about.