shawnghu comments on shawnghu’s Shortform

shawnghu 25 Sep 2025 5:11 UTC
1 point
0
Does anyone have a sense of whether, qualitatively, RL stability has been solved for any practical domains?

This question is at least in part asking for qualitative speculation about how the post-training RL works at big labs, but I’m interested in any partial answer people can come up with.

My impression of RL is that there are a lot of tricks to “improve stability”, but performance is path-dependent in pretty much any realistic/practical setting (where state space is huge and action space may be huge or continuous). Even for larger toy problems my sense is that various RL algorithms really only work like up to 70% of the time, and 30% of the time they randomly decline in reward.

One obvious way of getting around this is to just resample. If there are no more principled/reliable methods, this would be the default method of getting a good result from RL. It would follow that that’s just what the big labs do. But of course they may have secretly solved some stuff, but it’s hard to imagine what the form of that would be.