Is this identical to training the next-to-last layer to predict the rewards directly, and then just transforming those predictions to get a sample?
In the tabular case, that’s equivalent given uniform π0. Maybe it’s also true in the function approximator PG regime, but that’s a maybe—depends on inductive biases. But often we want a pretrained π0 (like when doing RLHF on LLMs), which isn’t uniform.
In the tabular case, that’s equivalent given uniform π0. Maybe it’s also true in the function approximator PG regime, but that’s a maybe—depends on inductive biases. But often we want a pretrained π0 (like when doing RLHF on LLMs), which isn’t uniform.