Taywon Min comments on Greedy-Advantage-Aware RLHF

Taywon Min 25 Jun 2025 5:15 UTC
1 point
0
Is it ok to compute the advantage function as a difference of Value functions?
To my understanding, the advantage in ppo is not simply the difference between value functions, but uses GAE, which depends on later value, reward terms of the sampled trajectory.
Shouldn’t we necessarily use that function during PPO training?