I agree with much of this post. I also have roughly 2032 medians to things going crazy, I agree learning on the job is very useful, and I’m also skeptical we’d see massive white collar automation without further AI progress.
However, I think Dwarkesh is wrong to suggest that RL fine-tuning can’t be qualitatively similar to how humans learn.
In the post, he discusses AIs constructing verifiable RL environments for themselves based on human feedback and then argues this wouldn’t be flexible and powerful enough to work, but RL could be used more similarly to how humans learn.
My best guess is that the way humans learn on the job is mostly by noticing when something went well (or poorly) and then sample efficiently updating (with their brain doing something analogous to an RL update). In some cases, this is based on external feedback (e.g. from a coworker) and in some cases it’s based on self-verification: the person just looking at the outcome of their actions and then determining if it went well or poorly.
So, you could imagine RL’ing an AI based on both external feedback and self-verification like this. And, this would be a “deliberate, adaptive process” like human learning. Why would this currently work worse than human learning?
Current AIs are worse than humans at two things which makes RL (quantitatively) much worse for them:
Robust self-verification: the ability to correctly determine when you’ve done something well/poorly in a way which is robust to you optimizing against it.
Sample efficiency: how much you learn from each update (potentially leveraging stuff like determining what caused things to go well/poorly which humans certainly take advantage of). This is especially important if you have sparse external feedback.
But, these are more like quantitative than qualitative issues IMO. AIs (and RL methods) are improving at both of these.
All that said, I think it’s very plausible that the route to better continual learning routes more through building on in-context learning (perhaps through something like neuralese, though this would greatly increase misalignment risks...).
Some more quibbles:
For the exact podcasting tasks Dwarkesh mentions, it really seems like simple fine-tuning mixed with a bit of RL would solve his problem. So, an automated training loop run by the AI could probably work here. This just isn’t deployed as an easy-to-use feature.
For many (IMO most) useful tasks, AIs are limited by something other than “learning on the job”. At autonomous software engineering, they fail to match humans with 3 hours of time and they are typically limited by being bad agents or by being generally dumb/confused. To be clear, it seems totally plausible that for podcasting tasks Dwarkesh mentions, learning is the limiting factor.
Correspondingly, I’d guess the reason that we don’t see people trying more complex RL based continual learning in normal deployments is that there is lower hanging fruit elsewhere and typically something else is the main blocker. I agree that if you had human level sample efficiency in learning this would immediately yield strong results (e.g., you’d have very superhuman AIs with 10^26 FLOP presumably), I’m just making a claim about more incremental progress.
I think Dwarkesh uses the term “intelligence” somewhat atypically when he says “The reason humans are so useful is not mainly their raw intelligence. It’s their ability to build up context, interrogate their own failures, and pick up small improvements and efficiencies as they practice a task.” I think people often consider how fast someone learns on the job as one aspect of intelligence. I agree there is a difference between short feedback loop intelligence (e.g. IQ tests) and long feedback loop intelligence and they are quite correlated in humans (while AIs tend to be relatively worse at long feedback loop intelligence).
Dwarkesh notes “An AI that is capable of online learning might functionally become a superintelligence quite rapidly, even if there’s no algorithmic progress after that point.” This seems reasonable, but it’s worth noting that if sample efficient learning is very compute expensive, then this might not happen so rapidly.
I think AIs will likely overcome poor sample efficiency to achieve a very high level of performance using a bunch of tricks (e.g. constructing a bunch of RL environments, using a ton of compute to learn when feedback is scarce, learning from much more data than humans due to “learn once deploy many” style strategies). I think we’ll probably see fully automated AI R&D prior to matching top human sample efficiency at learning on the job. Notably, if you do match top human sample efficiency at learning (while still using a similar amount of compute to the human brain), then we already have enough compute for this to basically immediately result in vastly superhuman AIs (human lifetime compute is maybe 3e23 FLOP and we’ll soon be doing 1e27 FLOP training runs). So, either sample efficiency must be worse or at least it must not be possible to match human sample efficiency without spending more compute per data-point/trajectory/episode.
I agree that robust self-verification and sample efficiency are the main things AIs are worse at than humans, and that this is basically just a quantitative difference. But what’s the best evidence that RL methods are getting more sample efficient (separate from AIs getting better at recognizing their own mistakes)? That’s not obvious to me but I’m not really read up on the literature. Is there a benchmark suite you think best illustrates that?
Better RL algorithms (including things that also improve pretraining sample efficiency like better architectures and optimizers).
Smarter models (in particular, smarter base models, though I’d also expect that RL in one domain makes RL in some other domain more sample efficient, at least after sufficient scale).
There isn’t great evidence that we’ve been seeing substantial improvements in RL algorithms recently, but AI companies are strongly incentivized to improve RL sample efficiency (as RL scaling appears to be yielding large returns) and there are a bunch of ML papers which claim to find substantial RL improvements (though it’s hard to be confident in these results for various reasons). So, we should probably infer that AI companies have made substantial gains in RL algorithms, but we don’t have public numbers. 3.7 sonnet was much better than 3.5 sonnet, but it’s hard to know how much of the gain was from RL algorithms vs from other areas.
Minimally, there is a long running trend in pretraining algorithmic efficiency and many of these improvements should also transfer somewhat to RL sample efficiency.
As far as evidence that smarter models learn more sample efficiently, I think the deepseek R1 paper has some results on this. Probably it’s also possible to find various pieces of support for this in the literature, but I’m more familiar with various anecdotes.
I agree with most of Dwarkesh’s post, with essentially the same exceptions you’ve listed.
I wrote about this recently in LLM AGI will have memory, and memory changes alignment, drawing essentially the conclusions you’ve given above. Continuous learning is a critical strength of humans and job substitution will be limited (but real) until LLM agents can do effective self-directed learning. It’s quite hard to say how fast that will happen, for the reasons you’ve given. Fine-tuning is a good deal like human habit/skill learning; the question is how well agents can select what to learn.
One nontrivial disagreement is on the barrier to long time horizon task performance. Humans don’t learn long time-horizon task performance primarily from RL. We learn in several ways at different scales, including learning new strategies which can be captured in language. All of those types of learning do rely on self-assessment and decisions about what’s worth learning, and those will be challenging and perhaps difficult to get out of LLM agents—although I don’t think there’s any fundamental barrier to squeezing workable judgments out of them, just some schlep in scaffodling and training to do it better.
Based on this logic, my timelines are getting longer in median (although rapid progress is still quite possible and we are far from prepared).
But I’m getting somewhat less pessimistic by the prospect of having incompetent autonomous agents with self-directed learning. These would probably both take over some jobs, and display egregious misalignment. I think they’ll be a visceral wakeup call that has decent odds of getting society properly freaked out about human-plus AI with a little time left to prepare.
However, I think Dwarkesh is wrong to suggest that RL fine-tuning can’t be qualitatively similar to how humans learn. In the post, he discusses AIs constructing verifiable RL environments for themselves based on human feedback and then argues this wouldn’t be flexible and powerful enough to work, but RL could be used more similarly to how humans learn.
I think the key difference is as of right now, RL fine-tuning doesn’t change the AI’s weights after training, and continual learning is meant to focus on AI weight changes after the notional training period ends.
Are you claiming that RL fine-tuning doesn’t change weights? This is wrong.
Maybe instead you’re saying “no one ongoingly does RL fine-tuning where they constantly are updating the weights throughout deployment (aka online training)”. My response is: sure, but they could do this, they just don’t because it’s logistically/practically pretty annoying and the performance improvement wouldn’t be that high, at least without some more focused R&D on making this work better.
Maybe instead you’re saying “no one ongoingly does RL fine-tuning where they constantly are updating the weights throughout deployment (aka online training)”. My response is: sure, but they could do this, they just don’t because it’s logistically/practically pretty annoying and the performance improvement wouldn’t be that high, at least without some more focused R&D on making this work better.
Yes, you got me right, and my claim here isn’t that the work is super-high returns right now, and sample efficiency/not being robust to optimization against a verifier are currently the big issues at the moment, but I am claiming that in practice, online training/constantly updating the weights throughout deployment will be necessary for AI to automate important jobs like AI researchers, because most tasks require history/continuously updating on successes and failures rather than 1-shotting the problem.
If you are correct in that there is a known solution, and it merely requires annoying logistical/practical work, then I’d accept short timelines as the default (modulo long-term memory issues in AI).
To expand on this, I also expect by default that something like a long-term memory/state will be necessary, due to the issue that not having a memory means you have to relearn basic skills dozens of times, and this drastically lengthens the time to complete a task to the extent that it’s not viable to use an AI instead of a human.
I think some long tasks are like a long list of steps that only require the output of the most recent step, and so they don’t really need long context. AI improves at those just by becoming more reliable and making fewer catastrophic mistakes. On the other hand, some tasks need the AI to remember and learn from everything it’s done so far, and that’s where it struggles- see how Claude Plays Pokémon gets stuck in loops and has to relearn things dozens of times.
I haven’t read the METR paper in full, but from the examples given I’m worried the tests might be biased in favor of an agent with no capacity for long term memory, or at least not hitting the thresholds where context limitations become a problem:
For instance, task #3 here is at the limit of current AI capabilities (takes an hour). But it’s also something that could plausibly be done with very little context; if the AI just puts all of the example files in its context window it might be able to write the rest of the decoder from scratch. It might not even need to have the example files in memory while it’s debugging its project against the test cases.
Whereas a task to fix a bug in a large software project, while it might take an engineer associated with that project “an hour” to finish, requires stretching the limits of the amount of information it can fit inside a context window, or recall beyond what we seem to be capable of doing today.
One possibility I’ve wondered about is whether AI can automate this learning work: start from a transcript of someone trying to do things with AI with mistakes and subsequent feedback, and then curating some data from that works well for RL fine-tuning. Or even distilling it into examples for in-context learning (which probably works somewhat well, sometimes, today).
A response to Dwarkesh’s post arguing continual learning is a bottleneck.
This is a response to Dwarkesh’s post “Why I have slightly longer timelines than some of my guests”. I originally posted this response on twitter here.
I agree with much of this post. I also have roughly 2032 medians to things going crazy, I agree learning on the job is very useful, and I’m also skeptical we’d see massive white collar automation without further AI progress.
However, I think Dwarkesh is wrong to suggest that RL fine-tuning can’t be qualitatively similar to how humans learn. In the post, he discusses AIs constructing verifiable RL environments for themselves based on human feedback and then argues this wouldn’t be flexible and powerful enough to work, but RL could be used more similarly to how humans learn.
My best guess is that the way humans learn on the job is mostly by noticing when something went well (or poorly) and then sample efficiently updating (with their brain doing something analogous to an RL update). In some cases, this is based on external feedback (e.g. from a coworker) and in some cases it’s based on self-verification: the person just looking at the outcome of their actions and then determining if it went well or poorly.
So, you could imagine RL’ing an AI based on both external feedback and self-verification like this. And, this would be a “deliberate, adaptive process” like human learning. Why would this currently work worse than human learning?
Current AIs are worse than humans at two things which makes RL (quantitatively) much worse for them:
Robust self-verification: the ability to correctly determine when you’ve done something well/poorly in a way which is robust to you optimizing against it.
Sample efficiency: how much you learn from each update (potentially leveraging stuff like determining what caused things to go well/poorly which humans certainly take advantage of). This is especially important if you have sparse external feedback.
But, these are more like quantitative than qualitative issues IMO. AIs (and RL methods) are improving at both of these.
All that said, I think it’s very plausible that the route to better continual learning routes more through building on in-context learning (perhaps through something like neuralese, though this would greatly increase misalignment risks...).
Some more quibbles:
For the exact podcasting tasks Dwarkesh mentions, it really seems like simple fine-tuning mixed with a bit of RL would solve his problem. So, an automated training loop run by the AI could probably work here. This just isn’t deployed as an easy-to-use feature.
For many (IMO most) useful tasks, AIs are limited by something other than “learning on the job”. At autonomous software engineering, they fail to match humans with 3 hours of time and they are typically limited by being bad agents or by being generally dumb/confused. To be clear, it seems totally plausible that for podcasting tasks Dwarkesh mentions, learning is the limiting factor.
Correspondingly, I’d guess the reason that we don’t see people trying more complex RL based continual learning in normal deployments is that there is lower hanging fruit elsewhere and typically something else is the main blocker. I agree that if you had human level sample efficiency in learning this would immediately yield strong results (e.g., you’d have very superhuman AIs with 10^26 FLOP presumably), I’m just making a claim about more incremental progress.
I think Dwarkesh uses the term “intelligence” somewhat atypically when he says “The reason humans are so useful is not mainly their raw intelligence. It’s their ability to build up context, interrogate their own failures, and pick up small improvements and efficiencies as they practice a task.” I think people often consider how fast someone learns on the job as one aspect of intelligence. I agree there is a difference between short feedback loop intelligence (e.g. IQ tests) and long feedback loop intelligence and they are quite correlated in humans (while AIs tend to be relatively worse at long feedback loop intelligence).
Dwarkesh notes “An AI that is capable of online learning might functionally become a superintelligence quite rapidly, even if there’s no algorithmic progress after that point.” This seems reasonable, but it’s worth noting that if sample efficient learning is very compute expensive, then this might not happen so rapidly.
I think AIs will likely overcome poor sample efficiency to achieve a very high level of performance using a bunch of tricks (e.g. constructing a bunch of RL environments, using a ton of compute to learn when feedback is scarce, learning from much more data than humans due to “learn once deploy many” style strategies). I think we’ll probably see fully automated AI R&D prior to matching top human sample efficiency at learning on the job. Notably, if you do match top human sample efficiency at learning (while still using a similar amount of compute to the human brain), then we already have enough compute for this to basically immediately result in vastly superhuman AIs (human lifetime compute is maybe 3e23 FLOP and we’ll soon be doing 1e27 FLOP training runs). So, either sample efficiency must be worse or at least it must not be possible to match human sample efficiency without spending more compute per data-point/trajectory/episode.
I agree that robust self-verification and sample efficiency are the main things AIs are worse at than humans, and that this is basically just a quantitative difference. But what’s the best evidence that RL methods are getting more sample efficient (separate from AIs getting better at recognizing their own mistakes)? That’s not obvious to me but I’m not really read up on the literature. Is there a benchmark suite you think best illustrates that?
RL sample efficiency can be improved by both:
Better RL algorithms (including things that also improve pretraining sample efficiency like better architectures and optimizers).
Smarter models (in particular, smarter base models, though I’d also expect that RL in one domain makes RL in some other domain more sample efficient, at least after sufficient scale).
There isn’t great evidence that we’ve been seeing substantial improvements in RL algorithms recently, but AI companies are strongly incentivized to improve RL sample efficiency (as RL scaling appears to be yielding large returns) and there are a bunch of ML papers which claim to find substantial RL improvements (though it’s hard to be confident in these results for various reasons). So, we should probably infer that AI companies have made substantial gains in RL algorithms, but we don’t have public numbers. 3.7 sonnet was much better than 3.5 sonnet, but it’s hard to know how much of the gain was from RL algorithms vs from other areas.
Minimally, there is a long running trend in pretraining algorithmic efficiency and many of these improvements should also transfer somewhat to RL sample efficiency.
As far as evidence that smarter models learn more sample efficiently, I think the deepseek R1 paper has some results on this. Probably it’s also possible to find various pieces of support for this in the literature, but I’m more familiar with various anecdotes.
I agree with most of Dwarkesh’s post, with essentially the same exceptions you’ve listed.
I wrote about this recently in LLM AGI will have memory, and memory changes alignment, drawing essentially the conclusions you’ve given above. Continuous learning is a critical strength of humans and job substitution will be limited (but real) until LLM agents can do effective self-directed learning. It’s quite hard to say how fast that will happen, for the reasons you’ve given. Fine-tuning is a good deal like human habit/skill learning; the question is how well agents can select what to learn.
One nontrivial disagreement is on the barrier to long time horizon task performance. Humans don’t learn long time-horizon task performance primarily from RL. We learn in several ways at different scales, including learning new strategies which can be captured in language. All of those types of learning do rely on self-assessment and decisions about what’s worth learning, and those will be challenging and perhaps difficult to get out of LLM agents—although I don’t think there’s any fundamental barrier to squeezing workable judgments out of them, just some schlep in scaffodling and training to do it better.
Based on this logic, my timelines are getting longer in median (although rapid progress is still quite possible and we are far from prepared).
But I’m getting somewhat less pessimistic by the prospect of having incompetent autonomous agents with self-directed learning. These would probably both take over some jobs, and display egregious misalignment. I think they’ll be a visceral wakeup call that has decent odds of getting society properly freaked out about human-plus AI with a little time left to prepare.
On this:
I think the key difference is as of right now, RL fine-tuning doesn’t change the AI’s weights after training, and continual learning is meant to focus on AI weight changes after the notional training period ends.
Are you claiming that RL fine-tuning doesn’t change weights? This is wrong.
Maybe instead you’re saying “no one ongoingly does RL fine-tuning where they constantly are updating the weights throughout deployment (aka online training)”. My response is: sure, but they could do this, they just don’t because it’s logistically/practically pretty annoying and the performance improvement wouldn’t be that high, at least without some more focused R&D on making this work better.
Yes, you got me right, and my claim here isn’t that the work is super-high returns right now, and sample efficiency/not being robust to optimization against a verifier are currently the big issues at the moment, but I am claiming that in practice, online training/constantly updating the weights throughout deployment will be necessary for AI to automate important jobs like AI researchers, because most tasks require history/continuously updating on successes and failures rather than 1-shotting the problem.
If you are correct in that there is a known solution, and it merely requires annoying logistical/practical work, then I’d accept short timelines as the default (modulo long-term memory issues in AI).
To expand on this, I also expect by default that something like a long-term memory/state will be necessary, due to the issue that not having a memory means you have to relearn basic skills dozens of times, and this drastically lengthens the time to complete a task to the extent that it’s not viable to use an AI instead of a human.
Comments below:
https://www.lesswrong.com/posts/hhbibJGt2aQqKJLb7/shortform-1#Snvr22zNTXmHcAhPA
https://www.lesswrong.com/posts/hhbibJGt2aQqKJLb7/shortform-1#vFq87Ge27gashgwy9
One possibility I’ve wondered about is whether AI can automate this learning work: start from a transcript of someone trying to do things with AI with mistakes and subsequent feedback, and then curating some data from that works well for RL fine-tuning. Or even distilling it into examples for in-context learning (which probably works somewhat well, sometimes, today).