My name is Charles Renshaw-Whitman. I am a physicist (‘symbol gremlin’) by training, currently a MATS scholar studying the connection between the structure of natural data and the structure of learned computations.
Currently training away an aversion to sharing my writing/thoughts publicly—please modulate tone of comments accordingly :)
I’ll take a look at the ProRL paper later today, thanks for the second.
I agree that RL inefficiency is one problem but I think this can be reasonably factored out in experiments if not in production. The “RL Razor” paper does an experiment where they do SFT on a KL budget and show they get the same ‘reduced forgetting’ effect—I think of this reduced forgetting, in light of the ‘off the principals’ paper, as being something like inertia of learning new representations, or inability to pass through regions of high curvature. That is, there are definitely still qualitative differences at the per-batch level; perhaps this is just an efficiency thing, but it’s plausible to me that it might be more like GD vs SGD in finding different types of solutions with different generalization properties because of the curvature bias. On priors this would surprise me for RL, but I guess I like these papers because they updated me away from that a bit.
As for the milestone self-play results, you’re right that they’ve no place in this story—my semi-cope pro-tem guess is that LLMs operate in a ‘different regime’. Two intuitions for this:
for the board games especially, there is no ‘curriculum of representations’ - perhaps learning strategic action is easier for RL than learning to have good ontic chunkings of the world. E.g. for Go, there is no need to learn complex hierarchical representations as table-stakes. The harder Atari games (eg Moctezuma) are counter-evidence to this, except perhaps insofar as their hardness was due to them needing more complex representations? This is an even poorer explanation for something like AlphaStar. But nonetheless, we ended up switching to the pre-training paradigm instead of riding A3C to ASI.
The relatively poor performance of things like process supervision for LLM training is still surprising to me and I can’t account for it. If value-learning methods ‘don’t work’ in the LLM regime, is this because there is a structural difference in the data? On priors this just has to be a skill issue, but presumably had someone really solved this and gotten 3-5 OOMs of RLVR efficiency, wouldn’t we be done already? And presumably enough money-effort has been expended that were dramatic success achievable, it’d’ve been done? (To be clear here, by ‘work’ I mean “work so well that per-token gradients aren’t more than a one or two OOMs worse than for pre-training”)