ryan_greenblatt comments on ryan_greenblatt’s Shortform

ryan_greenblatt 4 Jun 2025 15:23 UTC
12 points
0
Are you claiming that RL fine-tuning doesn’t change weights? This is wrong.

Maybe instead you’re saying “no one ongoingly does RL fine-tuning where they constantly are updating the weights throughout deployment (aka online training)”. My response is: sure, but they could do this, they just don’t because it’s logistically/practically pretty annoying and the performance improvement wouldn’t be that high, at least without some more focused R&D on making this work better.
What links here?
- Noosphere89's comment on Cole Wyeth’s Shortform by Cole Wyeth (5 Jun 2025 16:13 UTC; 2 points)
- Noosphere89 4 Jun 2025 15:43 UTC
  2 points
  0
  Parent
  Maybe instead you’re saying “no one ongoingly does RL fine-tuning where they constantly are updating the weights throughout deployment (aka online training)”. My response is: sure, but they could do this, they just don’t because it’s logistically/practically pretty annoying and the performance improvement wouldn’t be that high, at least without some more focused R&D on making this work better.
  Yes, you got me right, and my claim here isn’t that the work is super-high returns right now, and sample efficiency/not being robust to optimization against a verifier are currently the big issues at the moment, but I am claiming that in practice, online training/constantly updating the weights throughout deployment will be necessary for AI to automate important jobs like AI researchers, because most tasks require history/continuously updating on successes and failures rather than 1-shotting the problem.
  If you are correct in that there is a known solution, and it merely requires annoying logistical/practical work, then I’d accept short timelines as the default (modulo long-term memory issues in AI).
  To expand on this, I also expect by default that something like a long-term memory/state will be necessary, due to the issue that not having a memory means you have to relearn basic skills dozens of times, and this drastically lengthens the time to complete a task to the extent that it’s not viable to use an AI instead of a human.
  Comments below:
  https://www.lesswrong.com/posts/hhbibJGt2aQqKJLb7/shortform-1#Snvr22zNTXmHcAhPA
  I think some long tasks are like a long list of steps that only require the output of the most recent step, and so they don’t really need long context. AI improves at those just by becoming more reliable and making fewer catastrophic mistakes. On the other hand, some tasks need the AI to remember and learn from everything it’s done so far, and that’s where it struggles- see how Claude Plays Pokémon gets stuck in loops and has to relearn things dozens of times.
  https://www.lesswrong.com/posts/hhbibJGt2aQqKJLb7/shortform-1#vFq87Ge27gashgwy9
  I haven’t read the METR paper in full, but from the examples given I’m worried the tests might be biased in favor of an agent with no capacity for long term memory, or at least not hitting the thresholds where context limitations become a problem:
  For instance, task #3 here is at the limit of current AI capabilities (takes an hour). But it’s also something that could plausibly be done with very little context; if the AI just puts all of the example files in its context window it might be able to write the rest of the decoder from scratch. It might not even need to have the example files in memory while it’s debugging its project against the test cases.
  Whereas a task to fix a bug in a large software project, while it might take an engineer associated with that project “an hour” to finish, requires stretching the limits of the amount of information it can fit inside a context window, or recall beyond what we seem to be capable of doing today.