After thinking about this more, I have a hypothesis for how it works:
It records the sequence of states resulting from play, to be used for learning when the game is done. It has some Q value estimator, which given a state and an action, returns the expected utility of taking that action in that state (of course, this expected utility depends on the policies both players are using). Using the sequence of states and the actions (including exploration actions), it makes an update to this Q value estimator.
But this Q value estimator is not directly usable in determining a policy, since the Q value estimator needs to know the full state. Instead, the Q value estimator is used to compute a gradient update for the policy. After the game, the Q value estimator can tell you the value of taking each possible action at each actual time step in the game (since all game states are known at the end of the game); this can be used to update the policy so it is more likely to take the actions that are estimated to be better after the game.
I’m not sure if this hypothesis is correct, but I don’t currently see anything wrong with this algorithm. Thank you for causing me to think of this.
(BTW, there is a well-defined difference between full and partial observability, see MDP vs POMDP. Convergence theorems for vanilla Q learning will require the game to be a MDP rather than a POMDP.)
After thinking about this more, I have a hypothesis for how it works:
It records the sequence of states resulting from play, to be used for learning when the game is done. It has some Q value estimator, which given a state and an action, returns the expected utility of taking that action in that state (of course, this expected utility depends on the policies both players are using). Using the sequence of states and the actions (including exploration actions), it makes an update to this Q value estimator.
But this Q value estimator is not directly usable in determining a policy, since the Q value estimator needs to know the full state. Instead, the Q value estimator is used to compute a gradient update for the policy. After the game, the Q value estimator can tell you the value of taking each possible action at each actual time step in the game (since all game states are known at the end of the game); this can be used to update the policy so it is more likely to take the actions that are estimated to be better after the game.
I’m not sure if this hypothesis is correct, but I don’t currently see anything wrong with this algorithm. Thank you for causing me to think of this.
(BTW, there is a well-defined difference between full and partial observability, see MDP vs POMDP. Convergence theorems for vanilla Q learning will require the game to be a MDP rather than a POMDP.)