Your understanding is good. What you refer to with “self-rewarding” is called “reward shaping” in reinforcement learning. DQN doesn’t use this, so I didn’t talk about it in this post. DQN also doesn’t do well with games where you can’t stumble upon the first reward by random exploration, thus getting the process started. So, for your Point 3 - DQN doesn’t address this, but future RL solutions do.
Montezuma’s Revenge is one such example—the first reward requires traversing several obstacles correctly and is vanishingly unlikely to occur by random chance. If you look up the history of Montezuma’s Revenge in reinforcement learning you’ll find out some interesting ways people approached this problem, such as biasing the network towards states it hadn’t seen before (i.e, building “curiosity” into it). The basic DQN referred to in this post cannot solve Montezuma’s Revenge.
With your “5 keys” example, you’re almost correct. The agent needs to actually perform the entire combo at least once before it can learn to predict it in the future—the target network is looking at t+1, but that t+1 is collected from the replay buffer, so it has to have been performed at least once. So, sooner or later you’ll sample the state after 5 presses, and then be able to predict that reward in future. Then this cascades backwards over time exactly as you described.
Regarding Point 1, I’ve gone over and adjusted the terms—“future reward” now refers to the sum of rewards from step t onwards, and I’ve defined that early in Section 2, to make it clear that future reward also includes the reward from timestep t.
Regarding Point 2, you’re correct, and I’ve made those changes.
Thanks for the feedback, and I’m glad you got some useful info out of the article!
Your understanding is good. What you refer to with “self-rewarding” is called “reward shaping” in reinforcement learning. DQN doesn’t use this, so I didn’t talk about it in this post. DQN also doesn’t do well with games where you can’t stumble upon the first reward by random exploration, thus getting the process started. So, for your Point 3 - DQN doesn’t address this, but future RL solutions do.
Montezuma’s Revenge is one such example—the first reward requires traversing several obstacles correctly and is vanishingly unlikely to occur by random chance. If you look up the history of Montezuma’s Revenge in reinforcement learning you’ll find out some interesting ways people approached this problem, such as biasing the network towards states it hadn’t seen before (i.e, building “curiosity” into it). The basic DQN referred to in this post cannot solve Montezuma’s Revenge.
With your “5 keys” example, you’re almost correct. The agent needs to actually perform the entire combo at least once before it can learn to predict it in the future—the target network is looking at t+1, but that t+1 is collected from the replay buffer, so it has to have been performed at least once. So, sooner or later you’ll sample the state after 5 presses, and then be able to predict that reward in future. Then this cascades backwards over time exactly as you described.
Regarding Point 1, I’ve gone over and adjusted the terms—“future reward” now refers to the sum of rewards from step t onwards, and I’ve defined that early in Section 2, to make it clear that future reward also includes the reward from timestep t.
Regarding Point 2, you’re correct, and I’ve made those changes.
Thanks for the feedback, and I’m glad you got some useful info out of the article!