I think I largely agree with the comments here, and I don’t really have attachment to specific semantics around what exactly these terms mean. Here I’ll try to use my understanding of evhub’s meanings:
First: a disagreement on the separation.
A particular prediction I have now, but is weakly held, is that episode boundaries are weak and permeable, and will probably be obsolete at some point. There’s a bunch of reasons I think this, but maybe the easiest to explain is that humans learn and are generally intelligent and we don’t have episode boundaries.
Given this, I think the “within-episode exploration” and “across-episode exploration” relax into each other, and (as the distinction of episode boundaries fades) turn into the same thing, which I think is fine to call “safe exploration”.
Second: a learning efficiency perspective.
Another prediction I have now (also weakly held) is that we’ll have a smooth spectrum between “things we want” and “things we don’t want” with lots of stuff inbetween, instead of a sharp boundary between “accident” and “not accident”.
A small contemporary example is in robotics, often accidents are “robot arm crashes into the table”, but a non-accident we still want to avoid is “motions which incur more long-term wear than alternative motions”.
An important goal of safe exploration is that we would like to learn this quality with high efficiency. A joke about current methods is that we learn not to run into a wall by running into a wall millions of times.
With this in mind, I think for the moment sample efficiency gains are probably also safe exploration gains, but at some point we might have to trade off how to increase the learning efficiency of safety qualities by decreasing the learning efficiency of other qualities.
Given this, I think the “within-episode exploration” and “across-episode exploration” relax into each other, and (as the distinction of episode boundaries fades) turn into the same thing, which I think is fine to call “safe exploration”.
I agree with this. I jumped the gun a bit in not really making the distinction clear in my earlier post “Safe exploration and corrigibility,” but I think that made it a bit confusing, so I went heavy on the distinction here—but perhaps more heavy than I actually think is warranted.
The problem I have with relaxing within-episode and across-episode exploration into each other, though, is precisely the problem I describe in “Safe exploration and corrigibility,” however, which is that by default you only end up with capability exploration not objective exploration—that is, an agent with a goal (i.e. a mesa-optimizer) is only going to explore to the extent that it helps its current goal, not to the extent that it helps it change its goal to be more like the desired goal. Thus, you need to do something else (something that possibly looks somewhat like corrigibility) to get the agent to explore in such a way that helps you collect data on what its goal is and how to change it.
A particular prediction I have now, but is weakly held, is that episode boundaries are weak and permeable, and will probably be obsolete at some point. There’s a bunch of reasons I think this, but maybe the easiest to explain is that humans learn and are generally intelligent and we don’t have episode boundaries.
Given this, I think the “within-episode exploration” and “across-episode exploration” relax into each other, and (as the distinction of episode boundaries fades) turn into the same thing, which I think is fine to call “safe exploration”.
My main reason for making the separation is that in every deep RL algorithm I know of there is exploration-that-is-incentivized-by-gradient-descent and exploration-that-is-not-incentivized-by-gradient-descent and it seems like these should be distinguished. Currently due to episode boundaries these cleanly correspond to within-episode and across-episode exploration respectively, but even if episode boundaries become obsolete I expect the question of “is this exploration incentivized by the (outer) optimizer” to remain relevant. (Perhaps we could call this outer and inner exploration, where outer exploration the exploration that is not incentivized by the outer optimizer.)
I don’t have a strong opinion on whether “safe exploration” should refer to just outer exploration or both outer and inner exploration, since both options seem compatible with the existing ML definition.
(disclaimer: I worked on Safety Gym)
I think I largely agree with the comments here, and I don’t really have attachment to specific semantics around what exactly these terms mean. Here I’ll try to use my understanding of evhub’s meanings:
First: a disagreement on the separation.
A particular prediction I have now, but is weakly held, is that episode boundaries are weak and permeable, and will probably be obsolete at some point. There’s a bunch of reasons I think this, but maybe the easiest to explain is that humans learn and are generally intelligent and we don’t have episode boundaries.
Given this, I think the “within-episode exploration” and “across-episode exploration” relax into each other, and (as the distinction of episode boundaries fades) turn into the same thing, which I think is fine to call “safe exploration”.
Second: a learning efficiency perspective.
Another prediction I have now (also weakly held) is that we’ll have a smooth spectrum between “things we want” and “things we don’t want” with lots of stuff inbetween, instead of a sharp boundary between “accident” and “not accident”.
A small contemporary example is in robotics, often accidents are “robot arm crashes into the table”, but a non-accident we still want to avoid is “motions which incur more long-term wear than alternative motions”.
An important goal of safe exploration is that we would like to learn this quality with high efficiency. A joke about current methods is that we learn not to run into a wall by running into a wall millions of times.
With this in mind, I think for the moment sample efficiency gains are probably also safe exploration gains, but at some point we might have to trade off how to increase the learning efficiency of safety qualities by decreasing the learning efficiency of other qualities.
Hey Aray!
I agree with this. I jumped the gun a bit in not really making the distinction clear in my earlier post “Safe exploration and corrigibility,” but I think that made it a bit confusing, so I went heavy on the distinction here—but perhaps more heavy than I actually think is warranted.
The problem I have with relaxing within-episode and across-episode exploration into each other, though, is precisely the problem I describe in “Safe exploration and corrigibility,” however, which is that by default you only end up with capability exploration not objective exploration—that is, an agent with a goal (i.e. a mesa-optimizer) is only going to explore to the extent that it helps its current goal, not to the extent that it helps it change its goal to be more like the desired goal. Thus, you need to do something else (something that possibly looks somewhat like corrigibility) to get the agent to explore in such a way that helps you collect data on what its goal is and how to change it.
My main reason for making the separation is that in every deep RL algorithm I know of there is exploration-that-is-incentivized-by-gradient-descent and exploration-that-is-not-incentivized-by-gradient-descent and it seems like these should be distinguished. Currently due to episode boundaries these cleanly correspond to within-episode and across-episode exploration respectively, but even if episode boundaries become obsolete I expect the question of “is this exploration incentivized by the (outer) optimizer” to remain relevant. (Perhaps we could call this outer and inner exploration, where outer exploration the exploration that is not incentivized by the outer optimizer.)
I don’t have a strong opinion on whether “safe exploration” should refer to just outer exploration or both outer and inner exploration, since both options seem compatible with the existing ML definition.