Noosphere89 comments on Cole Wyeth’s Shortform

Noosphere89 5 Jun 2025 16:13 UTC
2 points
0
@ryan_greenblatt made a claim that continual learning/online training can already be done, but that right now it’s not super-high returns and requires annoying logistical/practical work to be done, and right now AI issues are elsewhere like sample efficiency and robust self-verification.
That would explain the likelihood of getting AGI by the 2030s being pretty high:
https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/#pEBbFmMm9bvmgotyZ
Are you claiming that RL fine-tuning doesn’t change weights? This is wrong.
Maybe instead you’re saying “no one ongoingly does RL fine-tuning where they constantly are updating the weights throughout deployment (aka online training)”. My response is: sure, but they could do this, they just don’t because it’s logistically/practically pretty annoying and the performance improvement wouldn’t be that high, at least without some more focused R&D on making this work better.
Ryan Greenblatt’s original comment:
https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/#xMSjPgiFEk8sKFTWt
My best guess is that the way humans learn on the job is mostly by noticing when something went well (or poorly) and then sample efficiently updating (with their brain doing something analogous to an RL update). In some cases, this is based on external feedback (e.g. from a coworker) and in some cases it’s based on self-verification: the person just looking at the outcome of their actions and then determining if it went well or poorly.
So, you could imagine RL’ing an AI based on both external feedback and self-verification like this. And, this would be a “deliberate, adaptive process” like human learning. Why would this currently work worse than human learning?
Current AIs are worse than humans at two things which makes RL (quantitatively) much worse for them:
1. Robust self-verification: the ability to correctly determine when you’ve done something well/poorly in a way which is robust to you optimizing against it.
2. Sample efficiency: how much you learn from each update (potentially leveraging stuff like determining what caused things to go well/poorly which humans certainly take advantage of). This is especially important if you have sparse external feedback.
But, these are more like quantitative than qualitative issues IMO. AIs (and RL methods) are improving at both of these.
All that said, I think it’s very plausible that the route to better continual learning routes more through building on in-context learning (perhaps through something like neuralese, though this would greatly increase misalignment risks...).
- For many (IMO most) useful tasks, AIs are limited by something other than “learning on the job”. At autonomous software engineering, they fail to match humans with 3 hours of time and they are typically limited by being bad agents or by being generally dumb/confused. To be clear, it seems totally plausible that for podcasting tasks Dwarkesh mentions, learning is the limiting factor.
- Correspondingly, I’d guess the reason that we don’t see people trying more complex RL based continual learning in normal deployments is that there is lower hanging fruit elsewhere and typically something else is the main blocker. I agree that if you had human level sample efficiency in learning this would immediately yield strong results (e.g., you’d have very superhuman AIs with 10^26 FLOP presumably), I’m just making a claim about more incremental progress.
- I think AIs will likely overcome poor sample efficiency to achieve a very high level of performance using a bunch of tricks (e.g. constructing a bunch of RL environments, using a ton of compute to learn when feedback is scarce, learning from much more data than humans due to “learn once deploy many” style strategies). I think we’ll probably see fully automated AI R&D prior to matching top human sample efficiency at learning on the job. Notably, if you do match top human sample efficiency at learning (while still using a similar amount of compute to the human brain), then we already have enough compute for this to basically immediately result in vastly superhuman AIs (human lifetime compute is maybe 3e23 FLOP and we’ll soon be doing 1e27 FLOP training runs). So, either sample efficiency must be worse or at least it must not be possible to match human sample efficiency without spending more compute per data-point/trajectory/episode.