AI can do longer and longer coding tasks. That makes it easier for AI builders to run different experiments that might let them build AGI. So either it’s the case that both (a) the long-horizon coding AI won’t help with experiment selection at all and (b) the experiments will saturate the available compute resources before they’re helpful; or, long-horizon coding AI will make strong AI come quickly.
I think it’s not too hard to believe (a) & (b), fwiw. Randomly run experiments might not lead to anyone figuring out the idea they need to build strong AI.
But this is not a good category; it contains both [the type of long coding task that involves having to creatively figure out several points] and also other long coding tasks. So the category does not support the inference. It makes it easier for AI builders to run… some funny subset of “long coding tasks”.
I agree it seems plausible that AI could accelerate progress by freeing up researcher time, but I think the case for horizon length predicting AI timelines is even weaker in such worlds. Overall I expect the benchmark would still mostly have the same problems—e.g., that the difficulty of tasks (even simple ones) is poorly described as a function of time cost; that benchmarkable proxies differ critically from their non-benchmarkable targets; that labs probably often use these benchmarks as explicit training targets, etc.—but also the additional (imo major) source of uncertainty about how much freeing up researcher time would accelerate progress.
A version of the argument I’ve heard:
AI can do longer and longer coding tasks. That makes it easier for AI builders to run different experiments that might let them build AGI. So either it’s the case that both (a) the long-horizon coding AI won’t help with experiment selection at all and (b) the experiments will saturate the available compute resources before they’re helpful; or, long-horizon coding AI will make strong AI come quickly.
I think it’s not too hard to believe (a) & (b), fwiw. Randomly run experiments might not lead to anyone figuring out the idea they need to build strong AI.
But this is not a good category; it contains both [the type of long coding task that involves having to creatively figure out several points] and also other long coding tasks. So the category does not support the inference. It makes it easier for AI builders to run… some funny subset of “long coding tasks”.
Yup. The missing assumption is that setting up and running experiments is inside the funny subset, perhaps because it’s fairly routine
I agree it seems plausible that AI could accelerate progress by freeing up researcher time, but I think the case for horizon length predicting AI timelines is even weaker in such worlds. Overall I expect the benchmark would still mostly have the same problems—e.g., that the difficulty of tasks (even simple ones) is poorly described as a function of time cost; that benchmarkable proxies differ critically from their non-benchmarkable targets; that labs probably often use these benchmarks as explicit training targets, etc.—but also the additional (imo major) source of uncertainty about how much freeing up researcher time would accelerate progress.