Charlie Steiner comments on Mode collapse in RL may be fueled by the update equation

Charlie Steiner 19 Jun 2023 22:17 UTC
LW: 2 AF: 1
0
AF
Is this identical to training the next-to-last layer to predict the rewards directly, and then just transforming those predictions to get a sample? Have you considered going whole hog on model-based RL here?
I’d be interested in avoiding mode collapse in cases where that’s not practical, like diffusion models. Actually, could you choose a reward that makes diffusion models equivalent to MCMC? Probably no good safety reason to do such a thing though.
- TurnTrout 19 Jun 2023 23:08 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Is this identical to training the next-to-last layer to predict the rewards directly, and then just transforming those predictions to get a sample?
  In the tabular case, that’s equivalent given uniform $π_{0}$ . Maybe it’s also true in the function approximator PG regime, but that’s a maybe—depends on inductive biases. But often we want a pretrained $π_{0}$ (like when doing RLHF on LLMs), which isn’t uniform.