I agree with the sentiment of this. When writing this I was aware that this would not apply to all cases of RL.
However, I think I disagree with w.r.t non-vanilla policy gradient methods (TRPO, PPO etc). Using advantage functions and baselines doesn’t change how things look from the perspective of the policy network. It is still only observing the environment and taking appropriate actions, and never “sees” the reward. Any advantage functions are only used in step 2 of my example, not step 1. (I’m sure there are schemes where this is not the case, but I think I’m correct for PPO.)
I’m less sure of this, but even in model-based systems, Q-learning etc, the planning/iteration happens with respect to the outputs of a value network, which is trained to be correlated with the reward, but isn’t the reward itself. For example, I would say that the MCTS procedure of MuZero does “want” something, but that thing is not plans that get high reward, but plans that score highly according to the system’s value function. (I’m happy to be deferential on this though.)
The other interesting case is Decision Transformers. DTs absolutely “get” reward. It is explicitly an input to the model! But I mentally bucket them up as generative models as opposed to systems that “want” reward.
Considering the fact that weight sharing between actor and critic networks is a common practice, and given the fact that the critic passes gradients (learned from the reward) to the actor, for most practical purposes the actor gets all of the information it needs about the reward.
Yeah. For non-vanilla PG methods, I didn’t mean to imply that the policy “sees” the rewards in step 1. I meant that a part of the agent (its value function) “sees” the rewards in the sense that those are direct supervision signals used to train it in step 2, where we’re determining the direction and strength of the policy update.
And yeah the model-based case is weirder. I can’t recall whether or not predicted rewards (from the dynamics model, not from the value function) are a part of the upper confidence bound score in MuZero. If it is, I’d think it’s fair to say that the overall policy (i.e. not just the policy network, but the policy w/ MCTS) “wants” reward.
I agree with the sentiment of this. When writing this I was aware that this would not apply to all cases of RL.
However, I think I disagree with w.r.t non-vanilla policy gradient methods (TRPO, PPO etc). Using advantage functions and baselines doesn’t change how things look from the perspective of the policy network. It is still only observing the environment and taking appropriate actions, and never “sees” the reward. Any advantage functions are only used in step 2 of my example, not step 1. (I’m sure there are schemes where this is not the case, but I think I’m correct for PPO.)
I’m less sure of this, but even in model-based systems, Q-learning etc, the planning/iteration happens with respect to the outputs of a value network, which is trained to be correlated with the reward, but isn’t the reward itself. For example, I would say that the MCTS procedure of MuZero does “want” something, but that thing is not plans that get high reward, but plans that score highly according to the system’s value function. (I’m happy to be deferential on this though.)
The other interesting case is Decision Transformers. DTs absolutely “get” reward. It is explicitly an input to the model! But I mentally bucket them up as generative models as opposed to systems that “want” reward.
Considering the fact that weight sharing between actor and critic networks is a common practice, and given the fact that the critic passes gradients (learned from the reward) to the actor, for most practical purposes the actor gets all of the information it needs about the reward.
This is the case for many common architectures.
Yeah. For non-vanilla PG methods, I didn’t mean to imply that the policy “sees” the rewards in step 1. I meant that a part of the agent (its value function) “sees” the rewards in the sense that those are direct supervision signals used to train it in step 2, where we’re determining the direction and strength of the policy update.
And yeah the model-based case is weirder. I can’t recall whether or not predicted rewards (from the dynamics model, not from the value function) are a part of the upper confidence bound score in MuZero. If it is, I’d think it’s fair to say that the overall policy (i.e. not just the policy network, but the policy w/ MCTS) “wants” reward.