The problem with this is also that it’s too expressive. For any policy π, you can encode that policy into this sort of gradient: if π takes action a in state s, you say that your gradient points towards a (or the state s’ that results from taking action a), and away from every other action / state.
The problem with this is also that it’s too expressive. For any policy π, you can encode that policy into this sort of gradient: if π takes action a in state s, you say that your gradient points towards a (or the state s’ that results from taking action a), and away from every other action / state.