Is this identical to training the next-to-last layer to predict the rewards directly, and then just transforming those predictions to get a sample? Have you considered going whole hog on model-based RL here?
I’d be interested in avoiding mode collapse in cases where that’s not practical, like diffusion models. Actually, could you choose a reward that makes diffusion models equivalent to MCMC? Probably no good safety reason to do such a thing though.
Is this identical to training the next-to-last layer to predict the rewards directly, and then just transforming those predictions to get a sample?
In the tabular case, that’s equivalent given uniform π0. Maybe it’s also true in the function approximator PG regime, but that’s a maybe—depends on inductive biases. But often we want a pretrained π0 (like when doing RLHF on LLMs), which isn’t uniform.
Is this identical to training the next-to-last layer to predict the rewards directly, and then just transforming those predictions to get a sample? Have you considered going whole hog on model-based RL here?
I’d be interested in avoiding mode collapse in cases where that’s not practical, like diffusion models. Actually, could you choose a reward that makes diffusion models equivalent to MCMC? Probably no good safety reason to do such a thing though.
In the tabular case, that’s equivalent given uniform π0. Maybe it’s also true in the function approximator PG regime, but that’s a maybe—depends on inductive biases. But often we want a pretrained π0 (like when doing RLHF on LLMs), which isn’t uniform.