I do think that if you found a zero-RL path to the same (or better) endpoint, it would often imply that you’ve grasped something about the problem more deeply, and that would often imply greater safety.
Some applications of RL are also just worse than equivalent options. As a trivial example, using reward sampling to construct a gradient to match a supervised loss gradient is adding a bunch of clearly-pointless intermediate steps.
I suspect there are less trivial cases, like how a decision transformer isn’t just learning an optimal policy for its dataset but rather a supertask: what different levels of performance look like on that task. By subsuming an RL-ish task in prediction, the predictor can/must develop a broader understanding of the task, and that understanding can interact with other parts of the greater model. While I can’t currently point to strong empirical evidence here, my intuition would be that certain kinds of behavioral collapse would be avoided by the RL-via-predictor because the distribution is far more explicitly maintained during training.[1][2]
But there are often reasons why the more-RL-shaped thing is currently being used. It’s not always trivial to swap over to something with some potential theoretical benefits when training at scale. So long as the RL-ish stuff fits within some reasonable bounds, I’m pretty okay with it and would treat it as a sufficiently low probability threat that you would want to be very careful about how you replaced it, because the alternative might be sneakily worse.[3]
Oh, this is a fascinating perspective.
So most uses of RL already just use a small-bit of RL.
So if the goal was “only use a little bit of RL”, that’s already happening.
Hmm… I still wonder if using even less RL would be safer still.
I do think that if you found a zero-RL path to the same (or better) endpoint, it would often imply that you’ve grasped something about the problem more deeply, and that would often imply greater safety.
Some applications of RL are also just worse than equivalent options. As a trivial example, using reward sampling to construct a gradient to match a supervised loss gradient is adding a bunch of clearly-pointless intermediate steps.
I suspect there are less trivial cases, like how a decision transformer isn’t just learning an optimal policy for its dataset but rather a supertask: what different levels of performance look like on that task. By subsuming an RL-ish task in prediction, the predictor can/must develop a broader understanding of the task, and that understanding can interact with other parts of the greater model. While I can’t currently point to strong empirical evidence here, my intuition would be that certain kinds of behavioral collapse would be avoided by the RL-via-predictor because the distribution is far more explicitly maintained during training.[1][2]
But there are often reasons why the more-RL-shaped thing is currently being used. It’s not always trivial to swap over to something with some potential theoretical benefits when training at scale. So long as the RL-ish stuff fits within some reasonable bounds, I’m pretty okay with it and would treat it as a sufficiently low probability threat that you would want to be very careful about how you replaced it, because the alternative might be sneakily worse.[3]
KL divergence penalties are one thing, but it’s hard to do better than the loss directly forcing adherence to the distribution.
You can also make a far more direct argument about model-level goal agnosticism in the context of prediction.
I don’t think this is likely, to be clear. They’re just both pretty low probability concerns (provided the optimization space is well-constrained).