Unfortunately, we can’t just copy this trick. Artificial evolution requires that we decide how to kill off / reproduce things, in the same way that animal breeding requires breeders to decide what they’re optimizing for. This puts us back at square one; IE, needing to get our gradient from somewhere else.
Suppose we have a good reward function (as is typically assumed in deep RL). We can just copy the trick in that setting, right? But the rest of the post makes it sound like you still think there’s a problem, in that even with that reward, you don’t know how to assign credit to each individual action. This is a problem that evolution also has; evolution seemed to manage it just fine.
(Similarly, even if you think actor-critic methods don’t count, surely REINFORCE is one-level learning? It works okay; added bells and whistles like critics are improvements to its sample efficiency.)
Yeah, I pretty strongly think there’s a problem—not necessarily an insoluble problem, but, one which has not been convincingly solved by any algorithm which I’ve seen. I think presentations of ML often obscure the problem (because it’s not that big a deal in practice—you can often define good enough episode boundaries or whatnot).
Suppose we have a good reward function (as is typically assumed in deep RL). We can just copy the trick in that setting, right? But the rest of the post makes it sound like you still think there’s a problem, in that even with that reward, you don’t know how to assign credit to each individual action. This is a problem that evolution also has; evolution seemed to manage it just fine.
Yeah, I feel like “matching rewards to actions is hard” is a pretty clear articulation of the problem.
I agree that it should be surprising, in some sense, that getting rewards isn’t enough. That’s why I wrote a post on it! But why do you think it should be enough? How do we “just copy the trick”??
I don’t agree that this is analogous to the problem evolution has. If evolution just “received” the overall population each generation, and had to figure out which genomes were good/bad based on that, it would be a more analogous situation. However, that’s not at all the case. Evolution “receives” a fairly rich vector of which genomes were better/worse, each generation. The analogous case for RL would be if you could output several actions each step, rather than just one, and receive feedback about each. But this is basically “access to counterfactuals”; to get this, you need a model.
(Similarly, even if you think actor-critic methods don’t count, surely REINFORCE is one-level learning? It works okay; added bells and whistles like critics are improvements to its sample efficiency.)
No, definitely not, unless I’m missing something big.
Note that REINFORCE uses the complete return from time t, which includes all future rewards up until the end of the episode. In this sense REINFORCE is a Monte Carlo algorithm and is well defined only for the episodic case with all updates made in retrospect after the episode is completed (like the Monte Carlo algorithms in Chapter 5). This is shown explicitly in the boxed on the next page.
So, REINFORCE “solves” the assignment of rewards to actions via the blunt device of an episodic assumption; all rewards in an episode are grouped with all actions during that episode. If you expand the episode to infinity (so as to make no assumption about episode boundaries), then you just aren’t learning. This means it’s not applicable to the case of an intelligence wandering around and interacting dynamically with a world, where there’s no particular bound on how the past may relate to present reward.
The “model” is thus extremely simple and hardwired, which makes it seem one-level. But you can’t get away with this if you want to interact and learn on-line with a really complex environment.
Also, since the episodic assumption is a form of myopia, REINFORCE is compatible with the conjecture that any gradients we can actually construct are going to incentivize some form of myopia.
Oh, I see. You could also have a version of REINFORCE that doesn’t make the episodic assumption, where every time you get a reward, you take a policy gradient step for each of the actions taken so far, with a weight that decays as actions go further back in time. You can’t prove anything interesting about this, but you also can’t prove anything interesting about actor-critic methods that don’t have episode boundaries, I think. Nonetheless, I’d expect it would somewhat work, in the same way that an actor-critic method would somewhat work. (I’m not sure which I expect to work better; partly it depends on the environment and the details of how you implement the actor-critic method.)
(All of this said with very weak confidence; I don’t know much RL theory)
You could also have a version of REINFORCE that doesn’t make the episodic assumption, where every time you get a reward, you take a policy gradient step for each of the actions taken so far, with a weight that decays as actions go further back in time. You can’t prove anything interesting about this, but you also can’t prove anything interesting about actor-critic methods that don’t have episode boundaries, I think.
Yeah, you can do this. I expect actor-critic to work better, because your suggestion is essentially a fixed model which says that actions are more relevant to temporally closer rewards (and that this is the only factor to consider).
I’m not sure how to further convey my sense that this is all very interesting. My model is that you’re like “ok sure” but don’t really see why I’m going on about this.
I’m not sure how to further convey my sense that this is all very interesting. My model is that you’re like “ok sure” but don’t really see why I’m going on about this.
Yeah, I think this is basically right. For the most part though, I’m trying to talk about things where I disagree with some (perceived) empirical claim, as opposed to the overall “but why even think about these things”—I am not surprised when it is hard to convey why things are interesting in an explicit way before the research is done.
Here, I was commenting on the perceived claim of “you need to have two-level algorithms in order to learn at all; a one-level algorithm is qualitatively different and can never succeed”, where my response is “but no, REINFORCE would do okay, though it might be more sample-inefficient”. But it seems like you aren’t claiming that, just claiming that two-level algorithms do quantitatively but not qualitatively better.
Actually, that wasn’t what I was trying to say. But, now that I think about it, I think you’re right.
I was thinking of the discounting variant of REINFORCE as having a fixed, but rather bad, model associating rewards with actions: rewards are tied more with actions nearby. So I was thinking of it as still two-level, just worse than actor-critic.
But, although the credit assignment will make mistakes (a predictable punishment which the agent can do nothing to avoid will nonetheless make any actions leading up to the punishment less likely in the future), they should average out in the long run (those ‘wrongfully punished’ actions should also be ‘wrongfully rewarded’). So it isn’t really right to think it strongly depends on the assumption.
Instead, it’s better to think of it as a true discounting function. IE, it’s not as assumption about the structure of consequences; it’s an expression of how much the system cares about distant rewards when taking an action. Under this interpretation, REINFORCE indeed “closes the gradient gap”—solves the credit assignment problem w/o restrictive modeling assumptions.
Maybe. It might also me argued that REINFORCE depends on some properties of the environment such as ergodicity. I’m not that familiar with the details.
But anyway, it now seems like a plausible counterexample.
Suppose we have a good reward function (as is typically assumed in deep RL). We can just copy the trick in that setting, right? But the rest of the post makes it sound like you still think there’s a problem, in that even with that reward, you don’t know how to assign credit to each individual action. This is a problem that evolution also has; evolution seemed to manage it just fine.
(Similarly, even if you think actor-critic methods don’t count, surely REINFORCE is one-level learning? It works okay; added bells and whistles like critics are improvements to its sample efficiency.)
Yeah, I pretty strongly think there’s a problem—not necessarily an insoluble problem, but, one which has not been convincingly solved by any algorithm which I’ve seen. I think presentations of ML often obscure the problem (because it’s not that big a deal in practice—you can often define good enough episode boundaries or whatnot).
Yeah, I feel like “matching rewards to actions is hard” is a pretty clear articulation of the problem.
I agree that it should be surprising, in some sense, that getting rewards isn’t enough. That’s why I wrote a post on it! But why do you think it should be enough? How do we “just copy the trick”??
I don’t agree that this is analogous to the problem evolution has. If evolution just “received” the overall population each generation, and had to figure out which genomes were good/bad based on that, it would be a more analogous situation. However, that’s not at all the case. Evolution “receives” a fairly rich vector of which genomes were better/worse, each generation. The analogous case for RL would be if you could output several actions each step, rather than just one, and receive feedback about each. But this is basically “access to counterfactuals”; to get this, you need a model.
No, definitely not, unless I’m missing something big.
From page 329 of this draft of Sutton & Barto:
So, REINFORCE “solves” the assignment of rewards to actions via the blunt device of an episodic assumption; all rewards in an episode are grouped with all actions during that episode. If you expand the episode to infinity (so as to make no assumption about episode boundaries), then you just aren’t learning. This means it’s not applicable to the case of an intelligence wandering around and interacting dynamically with a world, where there’s no particular bound on how the past may relate to present reward.
The “model” is thus extremely simple and hardwired, which makes it seem one-level. But you can’t get away with this if you want to interact and learn on-line with a really complex environment.
Also, since the episodic assumption is a form of myopia, REINFORCE is compatible with the conjecture that any gradients we can actually construct are going to incentivize some form of myopia.
Oh, I see. You could also have a version of REINFORCE that doesn’t make the episodic assumption, where every time you get a reward, you take a policy gradient step for each of the actions taken so far, with a weight that decays as actions go further back in time. You can’t prove anything interesting about this, but you also can’t prove anything interesting about actor-critic methods that don’t have episode boundaries, I think. Nonetheless, I’d expect it would somewhat work, in the same way that an actor-critic method would somewhat work. (I’m not sure which I expect to work better; partly it depends on the environment and the details of how you implement the actor-critic method.)
(All of this said with very weak confidence; I don’t know much RL theory)
Yeah, you can do this. I expect actor-critic to work better, because your suggestion is essentially a fixed model which says that actions are more relevant to temporally closer rewards (and that this is the only factor to consider).
I’m not sure how to further convey my sense that this is all very interesting. My model is that you’re like “ok sure” but don’t really see why I’m going on about this.
Yeah, I think this is basically right. For the most part though, I’m trying to talk about things where I disagree with some (perceived) empirical claim, as opposed to the overall “but why even think about these things”—I am not surprised when it is hard to convey why things are interesting in an explicit way before the research is done.
Here, I was commenting on the perceived claim of “you need to have two-level algorithms in order to learn at all; a one-level algorithm is qualitatively different and can never succeed”, where my response is “but no, REINFORCE would do okay, though it might be more sample-inefficient”. But it seems like you aren’t claiming that, just claiming that two-level algorithms do quantitatively but not qualitatively better.
Actually, that wasn’t what I was trying to say. But, now that I think about it, I think you’re right.
I was thinking of the discounting variant of REINFORCE as having a fixed, but rather bad, model associating rewards with actions: rewards are tied more with actions nearby. So I was thinking of it as still two-level, just worse than actor-critic.
But, although the credit assignment will make mistakes (a predictable punishment which the agent can do nothing to avoid will nonetheless make any actions leading up to the punishment less likely in the future), they should average out in the long run (those ‘wrongfully punished’ actions should also be ‘wrongfully rewarded’). So it isn’t really right to think it strongly depends on the assumption.
Instead, it’s better to think of it as a true discounting function. IE, it’s not as assumption about the structure of consequences; it’s an expression of how much the system cares about distant rewards when taking an action. Under this interpretation, REINFORCE indeed “closes the gradient gap”—solves the credit assignment problem w/o restrictive modeling assumptions.
Maybe. It might also me argued that REINFORCE depends on some properties of the environment such as ergodicity. I’m not that familiar with the details.
But anyway, it now seems like a plausible counterexample.