Is it ok to compute the advantage function as a difference of Value functions? To my understanding, the advantage in ppo is not simply the difference between value functions, but uses GAE, which depends on later value, reward terms of the sampled trajectory. Shouldn’t we necessarily use that function during PPO training?
Is it ok to compute the advantage function as a difference of Value functions?
To my understanding, the advantage in ppo is not simply the difference between value functions, but uses GAE, which depends on later value, reward terms of the sampled trajectory.
Shouldn’t we necessarily use that function during PPO training?