Actually, that wasn’t what I was trying to say. But, now that I think about it, I think you’re right.
I was thinking of the discounting variant of REINFORCE as having a fixed, but rather bad, model associating rewards with actions: rewards are tied more with actions nearby. So I was thinking of it as still two-level, just worse than actor-critic.
But, although the credit assignment will make mistakes (a predictable punishment which the agent can do nothing to avoid will nonetheless make any actions leading up to the punishment less likely in the future), they should average out in the long run (those ‘wrongfully punished’ actions should also be ‘wrongfully rewarded’). So it isn’t really right to think it strongly depends on the assumption.
Instead, it’s better to think of it as a true discounting function. IE, it’s not as assumption about the structure of consequences; it’s an expression of how much the system cares about distant rewards when taking an action. Under this interpretation, REINFORCE indeed “closes the gradient gap”—solves the credit assignment problem w/o restrictive modeling assumptions.
Maybe. It might also me argued that REINFORCE depends on some properties of the environment such as ergodicity. I’m not that familiar with the details.
But anyway, it now seems like a plausible counterexample.
Actually, that wasn’t what I was trying to say. But, now that I think about it, I think you’re right.
I was thinking of the discounting variant of REINFORCE as having a fixed, but rather bad, model associating rewards with actions: rewards are tied more with actions nearby. So I was thinking of it as still two-level, just worse than actor-critic.
But, although the credit assignment will make mistakes (a predictable punishment which the agent can do nothing to avoid will nonetheless make any actions leading up to the punishment less likely in the future), they should average out in the long run (those ‘wrongfully punished’ actions should also be ‘wrongfully rewarded’). So it isn’t really right to think it strongly depends on the assumption.
Instead, it’s better to think of it as a true discounting function. IE, it’s not as assumption about the structure of consequences; it’s an expression of how much the system cares about distant rewards when taking an action. Under this interpretation, REINFORCE indeed “closes the gradient gap”—solves the credit assignment problem w/o restrictive modeling assumptions.
Maybe. It might also me argued that REINFORCE depends on some properties of the environment such as ergodicity. I’m not that familiar with the details.
But anyway, it now seems like a plausible counterexample.