TurnTrout comments on Mode collapse in RL may be fueled by the update equation

TurnTrout 19 Jun 2023 23:08 UTC
LW: 2 AF: 2
0
AF
Is this identical to training the next-to-last layer to predict the rewards directly, and then just transforming those predictions to get a sample?
In the tabular case, that’s equivalent given uniform $π_{0}$ . Maybe it’s also true in the function approximator PG regime, but that’s a maybe—depends on inductive biases. But often we want a pretrained $π_{0}$ (like when doing RLHF on LLMs), which isn’t uniform.