Rohin Shah comments on Exploring safe exploration

Rohin Shah 17 Jan 2020 15:10 UTC
LW: 5 AF: 3
0
AF
A particular prediction I have now, but is weakly held, is that episode boundaries are weak and permeable, and will probably be obsolete at some point. There’s a bunch of reasons I think this, but maybe the easiest to explain is that humans learn and are generally intelligent and we don’t have episode boundaries.
Given this, I think the “within-episode exploration” and “across-episode exploration” relax into each other, and (as the distinction of episode boundaries fades) turn into the same thing, which I think is fine to call “safe exploration”.
My main reason for making the separation is that in every deep RL algorithm I know of there is exploration-that-is-incentivized-by-gradient-descent and exploration-that-is-not-incentivized-by-gradient-descent and it seems like these should be distinguished. Currently due to episode boundaries these cleanly correspond to within-episode and across-episode exploration respectively, but even if episode boundaries become obsolete I expect the question of “is this exploration incentivized by the (outer) optimizer” to remain relevant. (Perhaps we could call this outer and inner exploration, where outer exploration the exploration that is not incentivized by the outer optimizer.)
I don’t have a strong opinion on whether “safe exploration” should refer to just outer exploration or both outer and inner exploration, since both options seem compatible with the existing ML definition.