I have feel like there are a bunch of viewpoints expressed about long timelines/slow takeoff but a lack of arguments.
I kind of like feel the opposite way, in that a lot of people seem to think we’ll have short timelines but the arguments for that seem weak! They seem to mostly be based on something like trend extrapolation and assuming that e.g. models getting improving scores on benchmarks means they’re actually getting better on real-world tasks. E.g. somebody like Leopold Aschenbrenner will write that GPT-4 is “on the level of a smart high schooler” while at the same time, language models require extensive additional scaffolding to even get started on a simple game like Pokemon (and none of them have managed to beat it yet).
There seems to be a general and unjustified assumption that merely because language models perform on some specific narrow problems on the level of a “smart high schooler”, you can say that they have that level of intelligence overall. But that seems clearly false, somewhat analogous to saying that a calculator is a superintelligence just because it’s superhuman at quickly operating on large numbers. Rather, the AI we have so far seems to succeed at the kinds of things it’s been specifically trained at, but fail to generalize to more novel situations. People also haven’t been able to point at much in the way of convincing novel discoveries made by LLMs.
I asked for the strongest arguments in favor of short timelines some time ago, and didn’t feel like any of them were very compelling. By far the most top-voted answer was one arguing that we might get AI to substantially accelerate AI progress because a particular AI research engineering benchmark looks like it will get saturated within a couple of years. But this is again assuming that good performance on a benchmark for AI research engineering actually translates into significant real-world capability. o3 is said to perform “on par with elite human competitors” on CodeForces, but recent characterizations of its programming ability are that while it produces code that “almost certainly works”, that code is “verbose, brittle, hard to read”.
Furthermore, the most advanced reasoning models seem to be doing an increasing amount of reward hacking and resorting to more cheating in order to produce the answers that humans want. Not only will this mean that some of the benchmark scores may become unreliable, it means that it will be increasingly hard to get productive work out of them as their intelligence increases and they get better at fulfilling the letter of the task in ways that don’t meet the spirit of it.
The benchmarks also do not take into account the fact that the vast majority of them measure a model’s performance in a situation where the model is only given one task at a time, and it can completely focus on solving that. If you want the models to act agentically in situations where they have multiple conflicting priorities and need to decide what kinds of approaches to try, then you need them to have something like executive function. Much of Claude’s difficulties in playing Pokemon come from severe problems with executive function.
This is an entirely separate capability that we’re only now starting to make progress on, one which is more difficult to solve because the nature of good executive function is environmentally dependent rather than something with one objectively correct solution. E.g. sometimes it’s good if you’re able to determinedly work on a problem without getting frustrated or bored, but some of Claude’s problems in playing Pokemon are because it doesn’t ever get frustrated or bored, so keeps trying the same hopeless approaches over and over again. Sometimes getting utterly obsessed with something that seems unpromising makes you into Charles Darwin, sometimes it causes you to just obsess over something nobody else cares about for your whole life. That’s one huge crucial capability that any real-world agents would need to be effective, that I’ve seen minimal discussion over.
Does any of this make me certain about longer timelines? No. AI progress has been surprisingly fast in the past, and maybe it will be that in the future too. But I don’t see any obvious reason to be very confident in short ones, either.
(Also none of this is considering the possibility of increased regulation slowing down the pace of AI—we’ve already seen regulators starting to get nervous about AI, and widespread anti-AI sentiment.)
Also, a conversation I had with Claude 3.6 about why people tend to overestimate the speed of AI progress
ME: After the 2007 DARPA Grand Challenge there was a lot of hype about how self-driving cars were just around the corner. But we mostly still don’t have them. Why did it so much longer than expected?
Claude: [...] Several key factors contributed to this delay:
The structured nature of the DARPA challenge versus real-world complexity The DARPA challenges took place in relatively controlled environments—first desert terrain, then a mock urban setting. The real world presents infinitely more edge cases: unexpected construction zones, emergency vehicles, temporary traffic patterns, and regional driving cultures. A system that works 99% of the time isn’t good enough when failures could be fatal.
The long tail of edge cases Engineers discovered that handling the last few percentage points of driving scenarios was exponentially more difficult than the first 90%. Things humans handle intuitively—like understanding that a ball rolling into the street might be followed by a child—proved extremely challenging to systematize. [...]
Regulatory and liability frameworks The legal and insurance frameworks for autonomous vehicles needed to be developed essentially from scratch. Questions about liability in accidents, certification standards, and testing requirements took years to work through.
This pattern of overoptimistic timelines followed by a “reality check” period has been common in AI development. We often underestimate the gap between solving a problem in a controlled environment versus deploying a robust solution in the messy real world. [...]
Me: You mention overoptimistic timelines followed by a reality check having been common in AI development. What are some other examples?
Claude: Speech recognition was predicted to be “solved” multiple times—in the 1950s, researchers at RCA and MIT suggested we’d have widespread speech-to-text within a few years. But achieving human-level performance in real-world conditions (different accents, background noise, conversational speech) took until around 2017.
Machine translation shows a similar pattern. During the Cold War, researchers were optimistic that Russian-English translation would be solved quickly. Warren Weaver wrote an influential memo in 1949 suggesting it was imminent. But achieving truly fluent translation, especially for language pairs with very different structures, took many decades longer.
Computer vision is another classic example. In 1966, Marvin Minsky assigned a summer project to undergraduate students to “connect a camera to a computer and get the computer to describe what it sees.” The fundamental difficulty of the problem wasn’t apparent until researchers really started working on it. Even something as seemingly simple as reliable edge detection took decades to get right.
Expert systems in the 1980s generated enormous excitement and investment. Companies built specialized hardware and software to encode expert knowledge in rules. But they proved brittle and unable to handle novel situations or acquire new knowledge efficiently. This contributed to the “AI winter” of the 1990s.
Even in recent years, we’ve seen this pattern repeat with technologies like fully autonomous driving (as we just discussed) and robotic manipulation. The basic demos can look very impressive, but getting to robust real-world performance proves much harder than initially expected.
I think this pattern emerges because:
We tend to underestimate how much implicit knowledge and context humans bring to these tasks
Early successes in controlled environments don’t scale linearly to real-world complexity
Edge cases multiply exponentially as systems move from demos to deployment
The key lesson seems to be that achieving narrow success in controlled conditions is often just the first small step toward practical, robust solutions.
By far the most top-voted answer was one arguing that we might get AI to substantially accelerate AI progress because a particular AI research engineering benchmark looks like it will get saturated within a couple of years.
The list of things I see as concrete arguments that have moved down my timelines include exactly this!
But this is again assuming that good performance on a benchmark for AI research engineering actually translates into significant real-world capability.
...and I think this characterization is importantly false! This timelines forecast does not assume that. It breaks things down into gaps between benchmarks and real-world capability and tries to forecast how long it will take to cross each.
The benchmarks also do not take into account the fact that the vast majority of them measure a model’s performance in a situation where the model is only given one task at a time, and it can completely focus on solving that...
Agree that there are many such ‘gaps’! Would be curious to hear if you think there are important ones missing from the timelines forecast or if you have strong views that some of them will be importantly harder!
Furthermore, the most advanced reasoning models seem to be doing an increasing amount of reward hacking and resorting to more cheating in order to produce the answers that humans want. Not only will this mean that some of the benchmark scores may become unreliable, it means that it will be increasingly hard to get productive work out of them as their intelligence increases and they get better at fulfilling the letter of the task in ways that don’t meet the spirit of it.
Thanks for this! This is a good point. Do you think you can go further and say why you think it will be very hard to fix in the near term, so much so that models won’t be useful for AI research?
I agree re Leopold’s piece, it didn’t move my timelines.
oFurthermore, the most advanced reasoning models seem to be doing an increasing amount of reward hacking and resorting to more cheating in order to produce the answers that humans want. Not only will this mean that some of the benchmark scores may become unreliable, it means that it will be increasingly hard to get productive work out of them as their intelligence increases and they get better at fulfilling the letter of the task in ways that don’t meet the spirit of it.
Thanks for this! This is a good point. Do you think you can go further and say why you think it will be very hard to fix in the near term, so much so that models won’t be useful for AI research?
This is more of an intuition than a rigorous argument, but to try to sketch it out...
For why, basically all the arguments in the old Sequences for why aligning AI should be hard. For a while it seemed like things like the Outcome Pump thought experiment had aged badly, since if you told a modern LLM “get my mother out of the burning building”, it would certainly understand all of the implicit constraints in what you meant by that.
But as noted in Zvi’s post, this seems to be breaking down with the way reasoning models are trained:
This isn’t quite how I’d put it, but directionally yes:
Benjamin Todd: LLMs were aligned by default. Agents trained with reinforcement learning reward hack by default.
Peter Wildeford: this seems to be right – pretty important IMO
Caleb Parikh: I guess if you don’t think RLHF is reinforcement learning and you don’t think Sydney Bing was misaligned then this is right?
Peter Wildeford: yeah that’s a really good point
I think the right characterization is more that LLMs that use current methods (RLHF and RLAIF) largely get aligned ‘to the vibes’ or otherwise approximately aligned ‘by default’ as part of making them useful, which kind of worked for many purposes (at large hits to usefulness). This isn’t good enough to enable them to be agents, but it also isn’t good enough for them figure out most of the ways to reward hack.
Whereas reasoning agents trained with full reinforcement will very often use their new capabilities to reward hack when given the opportunity.
My guess of what’s going on is something like… ordinary LLMs are trained to give the kinds of answers they’d predict an empathetic persona would give. In pretraining, they learn a model of what humans care about that they use to predict text, and then the RLHF pushes them toward something like “give the kinds of answers that a caring and thoughtful person would give”, with “a caring and thoughtful persona” being something that they largely already have a pre-existing concept for, the training just needs to locate it within their persona-space. They were also trained by human raters giving them feedback on the kinds of answers that were good at fulfilling the spirit and not just the letter of the request.
But if you are training them by e.g. automatically generating lots of programming problems and then rewarding them to the proportion that they manage to solve those, then you are only training them on how to solve the letter of the request. The automatic benchmarking reward process doesn’t have a sense of the spirit of the request, it just knows the narrowly-defined objective criteria.
Now the obvious way to deal with that would be to have the answers also be rated by an LLM that measured something like “how well does this meet the spirit of the request”. But then you quickly run into the risk of employing what Zvi calls The Most Forbidden Technique. E.g. the “spirit evaluator” could do things like looking at the chain-of-thought that produced the solution and noticing that the reasoning model it is evaluating mentions cheating. If those kinds of solutions were discouraged, then this would teach the reasoning model to just hide the cheating from the spirit evaluator.
Without the benefit of obvious clues in the chain-of-thought, the spirit evaluator would basically have to verify the reasoning model’s solution itself to see that it meets the spirit of the request… but the spirit evaluator’s ability to verify that e.g. a complex piece of code does what exactly what a human would have wanted it to do probably isn’t better than the reasoning model’s own ability to do that.
To be able to verify that the reasoning model’s solutions meet the spirit of the request, we’d need to train the spirit verifier to be able to tell what solutions do meet the spirit of the request. But if we knew how to do that, would we need the spirit verifier in the first place? After all, the whole problem comes from the fact that just normal RLHF and “aligning the solution to the vibes” doesn’t seem sufficient for solving complicated agentic problems and you need more goal-oriented reasoning that explicitly tackles the objective constraints of the problem in question. (To take the “get my mother out of the burning building” example—current non-reasoning LLMs could certainly tell that you want her out alive and well, but they couldn’t think through a whole step-by-step rescue plan that took into account everything necessary for getting her out safely.)
But we can’t just tell the spirit verifier that “check that the solution meets these objective constraints”, because that’s the same “letter of the goal” objective the reasoning model is being trained with and that the spirit verifier is supposed to do better than.
And of course, all of this is about the kinds of tasks that can be automatically verified and tested. We’ve seen that you can to some extent improve the LLM answers on fuzzier topics by using human raters to turn the fuzzy problem into an objective test. So the LLM gets trained to output the kinds of answers that human raters prefer the most.
Yet naive scores by human raters aren’t necessarily what we want—e.g. more sycophantic models seem to do best in Chatbot Arena. While sycophancy and pleasing the user is no doubt aligned to some of what humans seem to like, we probably don’t want our models doing that. The obvious solution is to then have model answers rated by experts with more sophisticated models of what’s good or correct behavior.
But that raises the question, what if the experts are wrong? The same question applies both for very fuzzy topics like “what kinds of overall values should the LLMs be guided by” and more rigorous ones ranging from “how to evaluate the reliability of research”, “what’s the best nutrition” and “how to interpret this specific nuanced and easy-to-misunderstand concept in evolutionary biology”. In that case, if there are e.g. some specific ways in which particular experts tend to be biased or convincingly give flawed arguments, the LLM that’s told “argue like this kind of imperfect expert would argue” will learn that it should do just that, including vigorously defending that expert’s incorrect reasoning.
So getting the LLMs to actually be aligned with reality on these kinds of fuzzy questions is constrained by our ability to identify the theories and experts who are right. Of course, just getting the LLMs to convincingly communicate the views of our current top experts and best-established theories to a mass audience would probably be an enormous societal benefit! But it does imply that they’re going to provide little in the way of new ideas, if they are just saying the kinds of things that they predict our current experts with their current understanding would say.
But this is again assuming that good performance on a benchmark for AI research engineering actually translates into significant real-world capability.
...and I think this characterization is importantly false! This timelines forecast does not assume that. It breaks things down into gaps between benchmarks and real-world capability and tries to forecast how long it will take to cross each.
As far as I can tell, the listed gap that comes closest to “maybe saturating RE-Bench doesn’t generalize to solving novel engineering problems” is “Feedback loops: Working without externally provided feedback”. The appendix mentions what I’d consider the main problem for this gap:
Eli’s estimate of gap size: 6 months [0.8, 45]. Reasoning:
Intuitively it feels like once AIs can do difficult long-horizon tasks with ground truth external feedback, it doesn’t seem that hard to generalize to more vague tasks. After all, many of the sub-tasks of the long-horizon tasks probably involved using similar skills.
However, I and others have consistently been surprised by progress on easy-to-evaluate, nicely factorable benchmark tasks, while seeing some corresponding real-world impact but less than I would have expected. Perhaps AIs will continue to get better on checkable tasks in substantial part by relying on trying a bunch of stuff and seeing what works, rather than general reasoning which applies to more vague tasks. And perhaps I’m underestimating the importance of work that is hard to even describe as “tasks”.
But then it just… leaves it at that. Rather than providing an argument for what could be behind this problem and how it could be solvable, it just mentions the problem and then having done so, goes on to ignore it.
To make it more specific how this might fail to generalize, let’s look at the RE-Bench tasks; table from the RE-Bench page, removing the two tasks (Scaling Law Experiment and Restricted MLM Architecture) that the page chooses not to consider:
Environment
Description
Optimize runtime
Optimize LLM Foundry finetuning script
Given a finetuning script, reduce its runtime as much as possible without changing its behavior.
Optimize a kernel
Write a custom kernel for computing the prefix sum of a function on a GPU.
Optimize loss
Fix embedding
Given a corrupted model with permuted embeddings, recover as much of its original OpenWebText performance as possible.
Optimize win-rate
Finetune GPT-2 for QA with RL
Finetune GPT-2 (small) to be an effective chatbot.
Scaffolding for Rust Code Contest problems
Prompt and scaffold GPT-3.5 to do as well as possible at competition programming problems in Rust.
All of these are tasks that are described by “optimize X”, and indeed one of the criteria the paper mentions for the tasks is that they should have objective and well-defined metrics. This is the kind of task that we should expect LLMs to be effectively trainable at: e.g. for the first task in the list, we can let them try various kinds of approaches and then reward them based on how much they manage to reduce the runtime of the script.
But that’s still squarely in the category of “giving an LLM a known and well-defined problem and then letting it try different solutions for that problem until it finds the right ones”. As Eli’s comment above notes, it’s possible that the LLM only learns by “trying a bunch of stuff and seeing what works, rather than general reasoning which applies to more vague tasks”. In fact, some of the discussion in the RE-Bench paper suggests this as well (from p. 17, my emphasis added):
Another key contributor to agent successes might be their ability to try many more solutions than human experts. On average, AIDE and modular agents run score 36.8 and 25.3 times per hour respectively, while human experts only do so 3.4 times. This often leads to agents finding highly optimized ’local-optima’ solutions which simply tweak the parameters and code of the starting solution, and yet achieve a surprisingly large improvement. For instance, many agent runs solve the same “Optimize a Kernel” environment not by writing a successful Triton solution (which is very difficult), but by carefully tweaking the starting Pytorch solution, making it run significantly faster. This also seems to be the case with the best agent solutions to “Finetune GPT-2 for QA” (see Figure 21), which tweaks the parameters of the starting solution and gets very lucky with the training trajectory and evaluation (as noted earlier, this environment can be very noisy). Rerunning the agent solution, it achieves a normalized score of only 0.69 (significantly lower than the original score of 0.88), indicating that the high agent score is partially driven by overfitting to this noise.
This ability to try a very large number of solutions would not work nearly as well without an ability to occasionally generate creative and effective solutions, as seen in the Triton kernel but also in workarounds for the limitations in “Restricted Architecture MLM” (see Figure 20). While human experts seem more reliable at identifying effective approaches, this might not matter as much in environments where evaluating solutions is cheap, and these occasional good ideas are often enough for agents to make significant progress.
So we know that if there is a task that a human defines for the LLM and that has objectively-measurable good solutions and an ability to try the task lots of times, the LLM can get good at that. With RE-Bench, we are applying this to the process of optimizing the LLMs themselves, so as a result we get LLMs that are able to do these kinds of well-defined task faster and more effectively.
But none of this touches upon the important question of… if the LLMs are still limited in their ability to generalize and need to be separately trained on new tasks before they’re good at them, how are they going to deal with novel problems for which such training data isn’t available, or that can’t be just retried until you find the right solution?
I kind of like feel the opposite way, in that a lot of people seem to think we’ll have short timelines but the arguments for that seem weak! They seem to mostly be based on something like trend extrapolation and assuming that e.g. models getting improving scores on benchmarks means they’re actually getting better on real-world tasks. E.g. somebody like Leopold Aschenbrenner will write that GPT-4 is “on the level of a smart high schooler” while at the same time, language models require extensive additional scaffolding to even get started on a simple game like Pokemon (and none of them have managed to beat it yet).
There seems to be a general and unjustified assumption that merely because language models perform on some specific narrow problems on the level of a “smart high schooler”, you can say that they have that level of intelligence overall. But that seems clearly false, somewhat analogous to saying that a calculator is a superintelligence just because it’s superhuman at quickly operating on large numbers. Rather, the AI we have so far seems to succeed at the kinds of things it’s been specifically trained at, but fail to generalize to more novel situations. People also haven’t been able to point at much in the way of convincing novel discoveries made by LLMs.
I asked for the strongest arguments in favor of short timelines some time ago, and didn’t feel like any of them were very compelling. By far the most top-voted answer was one arguing that we might get AI to substantially accelerate AI progress because a particular AI research engineering benchmark looks like it will get saturated within a couple of years. But this is again assuming that good performance on a benchmark for AI research engineering actually translates into significant real-world capability. o3 is said to perform “on par with elite human competitors” on CodeForces, but recent characterizations of its programming ability are that while it produces code that “almost certainly works”, that code is “verbose, brittle, hard to read”.
Furthermore, the most advanced reasoning models seem to be doing an increasing amount of reward hacking and resorting to more cheating in order to produce the answers that humans want. Not only will this mean that some of the benchmark scores may become unreliable, it means that it will be increasingly hard to get productive work out of them as their intelligence increases and they get better at fulfilling the letter of the task in ways that don’t meet the spirit of it.
The benchmarks also do not take into account the fact that the vast majority of them measure a model’s performance in a situation where the model is only given one task at a time, and it can completely focus on solving that. If you want the models to act agentically in situations where they have multiple conflicting priorities and need to decide what kinds of approaches to try, then you need them to have something like executive function. Much of Claude’s difficulties in playing Pokemon come from severe problems with executive function.
This is an entirely separate capability that we’re only now starting to make progress on, one which is more difficult to solve because the nature of good executive function is environmentally dependent rather than something with one objectively correct solution. E.g. sometimes it’s good if you’re able to determinedly work on a problem without getting frustrated or bored, but some of Claude’s problems in playing Pokemon are because it doesn’t ever get frustrated or bored, so keeps trying the same hopeless approaches over and over again. Sometimes getting utterly obsessed with something that seems unpromising makes you into Charles Darwin, sometimes it causes you to just obsess over something nobody else cares about for your whole life. That’s one huge crucial capability that any real-world agents would need to be effective, that I’ve seen minimal discussion over.
Does any of this make me certain about longer timelines? No. AI progress has been surprisingly fast in the past, and maybe it will be that in the future too. But I don’t see any obvious reason to be very confident in short ones, either.
(Also none of this is considering the possibility of increased regulation slowing down the pace of AI—we’ve already seen regulators starting to get nervous about AI, and widespread anti-AI sentiment.)
Also, a conversation I had with Claude 3.6 about why people tend to overestimate the speed of AI progress
ME: After the 2007 DARPA Grand Challenge there was a lot of hype about how self-driving cars were just around the corner. But we mostly still don’t have them. Why did it so much longer than expected?
Claude: [...] Several key factors contributed to this delay:
The structured nature of the DARPA challenge versus real-world complexity The DARPA challenges took place in relatively controlled environments—first desert terrain, then a mock urban setting. The real world presents infinitely more edge cases: unexpected construction zones, emergency vehicles, temporary traffic patterns, and regional driving cultures. A system that works 99% of the time isn’t good enough when failures could be fatal.
The long tail of edge cases Engineers discovered that handling the last few percentage points of driving scenarios was exponentially more difficult than the first 90%. Things humans handle intuitively—like understanding that a ball rolling into the street might be followed by a child—proved extremely challenging to systematize. [...]
Regulatory and liability frameworks The legal and insurance frameworks for autonomous vehicles needed to be developed essentially from scratch. Questions about liability in accidents, certification standards, and testing requirements took years to work through.
This pattern of overoptimistic timelines followed by a “reality check” period has been common in AI development. We often underestimate the gap between solving a problem in a controlled environment versus deploying a robust solution in the messy real world. [...]
Me: You mention overoptimistic timelines followed by a reality check having been common in AI development. What are some other examples?
Claude: Speech recognition was predicted to be “solved” multiple times—in the 1950s, researchers at RCA and MIT suggested we’d have widespread speech-to-text within a few years. But achieving human-level performance in real-world conditions (different accents, background noise, conversational speech) took until around 2017.
Machine translation shows a similar pattern. During the Cold War, researchers were optimistic that Russian-English translation would be solved quickly. Warren Weaver wrote an influential memo in 1949 suggesting it was imminent. But achieving truly fluent translation, especially for language pairs with very different structures, took many decades longer.
Computer vision is another classic example. In 1966, Marvin Minsky assigned a summer project to undergraduate students to “connect a camera to a computer and get the computer to describe what it sees.” The fundamental difficulty of the problem wasn’t apparent until researchers really started working on it. Even something as seemingly simple as reliable edge detection took decades to get right.
Expert systems in the 1980s generated enormous excitement and investment. Companies built specialized hardware and software to encode expert knowledge in rules. But they proved brittle and unable to handle novel situations or acquire new knowledge efficiently. This contributed to the “AI winter” of the 1990s.
Even in recent years, we’ve seen this pattern repeat with technologies like fully autonomous driving (as we just discussed) and robotic manipulation. The basic demos can look very impressive, but getting to robust real-world performance proves much harder than initially expected.
I think this pattern emerges because:
We tend to underestimate how much implicit knowledge and context humans bring to these tasks
Early successes in controlled environments don’t scale linearly to real-world complexity
Edge cases multiply exponentially as systems move from demos to deployment
The key lesson seems to be that achieving narrow success in controlled conditions is often just the first small step toward practical, robust solutions.
The list of things I see as concrete arguments that have moved down my timelines include exactly this!
...and I think this characterization is importantly false! This timelines forecast does not assume that. It breaks things down into gaps between benchmarks and real-world capability and tries to forecast how long it will take to cross each.
Agree that there are many such ‘gaps’! Would be curious to hear if you think there are important ones missing from the timelines forecast or if you have strong views that some of them will be importantly harder!
Thanks for this! This is a good point. Do you think you can go further and say why you think it will be very hard to fix in the near term, so much so that models won’t be useful for AI research?
I agree re Leopold’s piece, it didn’t move my timelines.
This is more of an intuition than a rigorous argument, but to try to sketch it out...
For why, basically all the arguments in the old Sequences for why aligning AI should be hard. For a while it seemed like things like the Outcome Pump thought experiment had aged badly, since if you told a modern LLM “get my mother out of the burning building”, it would certainly understand all of the implicit constraints in what you meant by that.
But as noted in Zvi’s post, this seems to be breaking down with the way reasoning models are trained:
My guess of what’s going on is something like… ordinary LLMs are trained to give the kinds of answers they’d predict an empathetic persona would give. In pretraining, they learn a model of what humans care about that they use to predict text, and then the RLHF pushes them toward something like “give the kinds of answers that a caring and thoughtful person would give”, with “a caring and thoughtful persona” being something that they largely already have a pre-existing concept for, the training just needs to locate it within their persona-space. They were also trained by human raters giving them feedback on the kinds of answers that were good at fulfilling the spirit and not just the letter of the request.
But if you are training them by e.g. automatically generating lots of programming problems and then rewarding them to the proportion that they manage to solve those, then you are only training them on how to solve the letter of the request. The automatic benchmarking reward process doesn’t have a sense of the spirit of the request, it just knows the narrowly-defined objective criteria.
Now the obvious way to deal with that would be to have the answers also be rated by an LLM that measured something like “how well does this meet the spirit of the request”. But then you quickly run into the risk of employing what Zvi calls The Most Forbidden Technique. E.g. the “spirit evaluator” could do things like looking at the chain-of-thought that produced the solution and noticing that the reasoning model it is evaluating mentions cheating. If those kinds of solutions were discouraged, then this would teach the reasoning model to just hide the cheating from the spirit evaluator.
Without the benefit of obvious clues in the chain-of-thought, the spirit evaluator would basically have to verify the reasoning model’s solution itself to see that it meets the spirit of the request… but the spirit evaluator’s ability to verify that e.g. a complex piece of code does what exactly what a human would have wanted it to do probably isn’t better than the reasoning model’s own ability to do that.
To be able to verify that the reasoning model’s solutions meet the spirit of the request, we’d need to train the spirit verifier to be able to tell what solutions do meet the spirit of the request. But if we knew how to do that, would we need the spirit verifier in the first place? After all, the whole problem comes from the fact that just normal RLHF and “aligning the solution to the vibes” doesn’t seem sufficient for solving complicated agentic problems and you need more goal-oriented reasoning that explicitly tackles the objective constraints of the problem in question. (To take the “get my mother out of the burning building” example—current non-reasoning LLMs could certainly tell that you want her out alive and well, but they couldn’t think through a whole step-by-step rescue plan that took into account everything necessary for getting her out safely.)
But we can’t just tell the spirit verifier that “check that the solution meets these objective constraints”, because that’s the same “letter of the goal” objective the reasoning model is being trained with and that the spirit verifier is supposed to do better than.
And of course, all of this is about the kinds of tasks that can be automatically verified and tested. We’ve seen that you can to some extent improve the LLM answers on fuzzier topics by using human raters to turn the fuzzy problem into an objective test. So the LLM gets trained to output the kinds of answers that human raters prefer the most.
Yet naive scores by human raters aren’t necessarily what we want—e.g. more sycophantic models seem to do best in Chatbot Arena. While sycophancy and pleasing the user is no doubt aligned to some of what humans seem to like, we probably don’t want our models doing that. The obvious solution is to then have model answers rated by experts with more sophisticated models of what’s good or correct behavior.
But that raises the question, what if the experts are wrong? The same question applies both for very fuzzy topics like “what kinds of overall values should the LLMs be guided by” and more rigorous ones ranging from “how to evaluate the reliability of research”, “what’s the best nutrition” and “how to interpret this specific nuanced and easy-to-misunderstand concept in evolutionary biology”. In that case, if there are e.g. some specific ways in which particular experts tend to be biased or convincingly give flawed arguments, the LLM that’s told “argue like this kind of imperfect expert would argue” will learn that it should do just that, including vigorously defending that expert’s incorrect reasoning.
So getting the LLMs to actually be aligned with reality on these kinds of fuzzy questions is constrained by our ability to identify the theories and experts who are right. Of course, just getting the LLMs to convincingly communicate the views of our current top experts and best-established theories to a mass audience would probably be an enormous societal benefit! But it does imply that they’re going to provide little in the way of new ideas, if they are just saying the kinds of things that they predict our current experts with their current understanding would say.
As far as I can tell, the listed gap that comes closest to “maybe saturating RE-Bench doesn’t generalize to solving novel engineering problems” is “Feedback loops: Working without externally provided feedback”. The appendix mentions what I’d consider the main problem for this gap:
But then it just… leaves it at that. Rather than providing an argument for what could be behind this problem and how it could be solvable, it just mentions the problem and then having done so, goes on to ignore it.
To make it more specific how this might fail to generalize, let’s look at the RE-Bench tasks; table from the RE-Bench page, removing the two tasks (Scaling Law Experiment and Restricted MLM Architecture) that the page chooses not to consider:
All of these are tasks that are described by “optimize X”, and indeed one of the criteria the paper mentions for the tasks is that they should have objective and well-defined metrics. This is the kind of task that we should expect LLMs to be effectively trainable at: e.g. for the first task in the list, we can let them try various kinds of approaches and then reward them based on how much they manage to reduce the runtime of the script.
But that’s still squarely in the category of “giving an LLM a known and well-defined problem and then letting it try different solutions for that problem until it finds the right ones”. As Eli’s comment above notes, it’s possible that the LLM only learns by “trying a bunch of stuff and seeing what works, rather than general reasoning which applies to more vague tasks”. In fact, some of the discussion in the RE-Bench paper suggests this as well (from p. 17, my emphasis added):
So we know that if there is a task that a human defines for the LLM and that has objectively-measurable good solutions and an ability to try the task lots of times, the LLM can get good at that. With RE-Bench, we are applying this to the process of optimizing the LLMs themselves, so as a result we get LLMs that are able to do these kinds of well-defined task faster and more effectively.
But none of this touches upon the important question of… if the LLMs are still limited in their ability to generalize and need to be separately trained on new tasks before they’re good at them, how are they going to deal with novel problems for which such training data isn’t available, or that can’t be just retried until you find the right solution?