Oliver Sourbut comments on Mode collapse in RL may be fueled by the update equation

Oliver Sourbut 21 Jun 2023 8:36 UTC
LW: 1 AF: 1
−2
AF

The problem is that this advantage can oscillate forever.

This is a pretty standard point in RL textbooks. But the culprit is the learning rate (which you set to be 1 in the example, but you can construct a nonconverging case for any constant $α$ )! The advantage definition itself is correct and non-oscillating, it’s the estimation of the expectation using a moving average which is (sometimes) at fault.

Oscillating or nonconvergent value estimation is not the cause of policy mode collapse.
What links here?
- Oliver Sourbut's comment on Mode collapse in RL may be fueled by the update equation by TurnTrout (21 Jun 2023 8:36 UTC; 3 points)
- TurnTrout 26 Jun 2023 17:52 UTC
  LW: 2 AF: 2
  0
  AF Parent
  The advantage definition itself is correct and non-oscillating… Oscillating or nonconvergent value estimation is not the cause of policy mode collapse.
  The advantage is (IIUC) defined with respect to a given policy, and so the advantage can oscillate and then cause mode collapse. I agree that a constant learning rate schedule is problematic, but note that ACTDE converges even with a constant learning rate schedule. So, I would indeed say that oscillating value estimation caused mode collapse in the toy example I gave?