this kind of failure happens by default in policy gradient methods.
It looks like you’re kind of agreeing here that value estimate oscillation isn’t the culprit? Again I think this is pretty standard—though the finger is usually not pointed at any particular value estimator or whatnot, but rather at the greediness of updating only on so-far-observed data i.e. the exploration problem. The GLIE conditions[1] - Greedy in the Limit with Infinite Exploration—are a classic result. Hence the plethora of exploration techniques which are researched and employed in RL.
Techniques like confidence bounding[2] based on Hoeffding’s inequality and Thompson sampling based on Bayesian uncertainty require more than a simple mean estimate (which is all that a value or advantage is): typically at least also one spread/uncertainty estimate[3]. Entropy regularisation, epsilon ‘exploration’, intrinsic ‘curiosity’ rewards, value-of-information estimation and so on are all heuristics for engaging with exploration.
Epsilon exploration can get away without a spread estimate, but its GLIE guarantees are only provided if there’s an epsilon per state, which secretly smuggles in an uncertainty estimate (because you’re tracking the progress bar on each state somehow, which means you’re tracking how often you’ve seen it).
It looks like you’re kind of agreeing here that value estimate oscillation isn’t the culprit? Again I think this is pretty standard—though the finger is usually not pointed at any particular value estimator or whatnot, but rather at the greediness of updating only on so-far-observed data i.e. the exploration problem. The GLIE conditions[1] - Greedy in the Limit with Infinite Exploration—are a classic result. Hence the plethora of exploration techniques which are researched and employed in RL.
Techniques like confidence bounding[2] based on Hoeffding’s inequality and Thompson sampling based on Bayesian uncertainty require more than a simple mean estimate (which is all that a value or advantage is): typically at least also one spread/uncertainty estimate[3]. Entropy regularisation, epsilon ‘exploration’, intrinsic ‘curiosity’ rewards, value-of-information estimation and so on are all heuristics for engaging with exploration.
I don’t know what’s a good resource on GLIE, but you can just look up Greedy in the Limit with Infinite Exploration
Amazingly there’s no Wikipedia entry on UCB??
Epsilon exploration can get away without a spread estimate, but its GLIE guarantees are only provided if there’s an epsilon per state, which secretly smuggles in an uncertainty estimate (because you’re tracking the progress bar on each state somehow, which means you’re tracking how often you’ve seen it).