beren comments on On value in humans, other animals, and AI

beren 9 Feb 2023 20:38 UTC
LW: 1 AF: 1
0
AF
My understanding is that after a lot of simplifications, policy gradients just takes a noisy gradient step in the direction of minimising Bellman error, and so in the limit of infinite data/computation/visiting all states in the world, it is ‘guaranteed’ to converge to an optimal policy for the MDP. Q learning and other model-free algorithms have similar guarantees. In practice, with function approximation, and PPOs regularisation bits, these guarantees do not hold anymore, but the fundamental RL they are built off of does have them. The place to go deeper into this is Sutton and Bart’s textbook and also Bertsekas’ dynamic programming textbook
- TurnTrout 13 Feb 2023 23:41 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Yeah, I’ve read those books, although I admit to heavily skimming Bertsekas.