But this is again assuming that good performance on a benchmark for AI research engineering actually translates into significant real-world capability.
...and I think this characterization is importantly false! This timelines forecast does not assume that. It breaks things down into gaps between benchmarks and real-world capability and tries to forecast how long it will take to cross each.
As far as I can tell, the listed gap that comes closest to “maybe saturating RE-Bench doesn’t generalize to solving novel engineering problems” is “Feedback loops: Working without externally provided feedback”. The appendix mentions what I’d consider the main problem for this gap:
Eli’s estimate of gap size: 6 months [0.8, 45]. Reasoning:
Intuitively it feels like once AIs can do difficult long-horizon tasks with ground truth external feedback, it doesn’t seem that hard to generalize to more vague tasks. After all, many of the sub-tasks of the long-horizon tasks probably involved using similar skills.
However, I and others have consistently been surprised by progress on easy-to-evaluate, nicely factorable benchmark tasks, while seeing some corresponding real-world impact but less than I would have expected. Perhaps AIs will continue to get better on checkable tasks in substantial part by relying on trying a bunch of stuff and seeing what works, rather than general reasoning which applies to more vague tasks. And perhaps I’m underestimating the importance of work that is hard to even describe as “tasks”.
But then it just… leaves it at that. Rather than providing an argument for what could be behind this problem and how it could be solvable, it just mentions the problem and then having done so, goes on to ignore it.
To make it more specific how this might fail to generalize, let’s look at the RE-Bench tasks; table from the RE-Bench page, removing the two tasks (Scaling Law Experiment and Restricted MLM Architecture) that the page chooses not to consider:
Environment
Description
Optimize runtime
Optimize LLM Foundry finetuning script
Given a finetuning script, reduce its runtime as much as possible without changing its behavior.
Optimize a kernel
Write a custom kernel for computing the prefix sum of a function on a GPU.
Optimize loss
Fix embedding
Given a corrupted model with permuted embeddings, recover as much of its original OpenWebText performance as possible.
Optimize win-rate
Finetune GPT-2 for QA with RL
Finetune GPT-2 (small) to be an effective chatbot.
Scaffolding for Rust Code Contest problems
Prompt and scaffold GPT-3.5 to do as well as possible at competition programming problems in Rust.
All of these are tasks that are described by “optimize X”, and indeed one of the criteria the paper mentions for the tasks is that they should have objective and well-defined metrics. This is the kind of task that we should expect LLMs to be effectively trainable at: e.g. for the first task in the list, we can let them try various kinds of approaches and then reward them based on how much they manage to reduce the runtime of the script.
But that’s still squarely in the category of “giving an LLM a known and well-defined problem and then letting it try different solutions for that problem until it finds the right ones”. As Eli’s comment above notes, it’s possible that the LLM only learns by “trying a bunch of stuff and seeing what works, rather than general reasoning which applies to more vague tasks”. In fact, some of the discussion in the RE-Bench paper suggests this as well (from p. 17, my emphasis added):
Another key contributor to agent successes might be their ability to try many more solutions than human experts. On average, AIDE and modular agents run score 36.8 and 25.3 times per hour respectively, while human experts only do so 3.4 times. This often leads to agents finding highly optimized ’local-optima’ solutions which simply tweak the parameters and code of the starting solution, and yet achieve a surprisingly large improvement. For instance, many agent runs solve the same “Optimize a Kernel” environment not by writing a successful Triton solution (which is very difficult), but by carefully tweaking the starting Pytorch solution, making it run significantly faster. This also seems to be the case with the best agent solutions to “Finetune GPT-2 for QA” (see Figure 21), which tweaks the parameters of the starting solution and gets very lucky with the training trajectory and evaluation (as noted earlier, this environment can be very noisy). Rerunning the agent solution, it achieves a normalized score of only 0.69 (significantly lower than the original score of 0.88), indicating that the high agent score is partially driven by overfitting to this noise.
This ability to try a very large number of solutions would not work nearly as well without an ability to occasionally generate creative and effective solutions, as seen in the Triton kernel but also in workarounds for the limitations in “Restricted Architecture MLM” (see Figure 20). While human experts seem more reliable at identifying effective approaches, this might not matter as much in environments where evaluating solutions is cheap, and these occasional good ideas are often enough for agents to make significant progress.
So we know that if there is a task that a human defines for the LLM and that has objectively-measurable good solutions and an ability to try the task lots of times, the LLM can get good at that. With RE-Bench, we are applying this to the process of optimizing the LLMs themselves, so as a result we get LLMs that are able to do these kinds of well-defined task faster and more effectively.
But none of this touches upon the important question of… if the LLMs are still limited in their ability to generalize and need to be separately trained on new tasks before they’re good at them, how are they going to deal with novel problems for which such training data isn’t available, or that can’t be just retried until you find the right solution?
As far as I can tell, the listed gap that comes closest to “maybe saturating RE-Bench doesn’t generalize to solving novel engineering problems” is “Feedback loops: Working without externally provided feedback”. The appendix mentions what I’d consider the main problem for this gap:
But then it just… leaves it at that. Rather than providing an argument for what could be behind this problem and how it could be solvable, it just mentions the problem and then having done so, goes on to ignore it.
To make it more specific how this might fail to generalize, let’s look at the RE-Bench tasks; table from the RE-Bench page, removing the two tasks (Scaling Law Experiment and Restricted MLM Architecture) that the page chooses not to consider:
All of these are tasks that are described by “optimize X”, and indeed one of the criteria the paper mentions for the tasks is that they should have objective and well-defined metrics. This is the kind of task that we should expect LLMs to be effectively trainable at: e.g. for the first task in the list, we can let them try various kinds of approaches and then reward them based on how much they manage to reduce the runtime of the script.
But that’s still squarely in the category of “giving an LLM a known and well-defined problem and then letting it try different solutions for that problem until it finds the right ones”. As Eli’s comment above notes, it’s possible that the LLM only learns by “trying a bunch of stuff and seeing what works, rather than general reasoning which applies to more vague tasks”. In fact, some of the discussion in the RE-Bench paper suggests this as well (from p. 17, my emphasis added):
So we know that if there is a task that a human defines for the LLM and that has objectively-measurable good solutions and an ability to try the task lots of times, the LLM can get good at that. With RE-Bench, we are applying this to the process of optimizing the LLMs themselves, so as a result we get LLMs that are able to do these kinds of well-defined task faster and more effectively.
But none of this touches upon the important question of… if the LLMs are still limited in their ability to generalize and need to be separately trained on new tasks before they’re good at them, how are they going to deal with novel problems for which such training data isn’t available, or that can’t be just retried until you find the right solution?