Given this, I think the “within-episode exploration” and “across-episode exploration” relax into each other, and (as the distinction of episode boundaries fades) turn into the same thing, which I think is fine to call “safe exploration”.
I agree with this. I jumped the gun a bit in not really making the distinction clear in my earlier post “Safe exploration and corrigibility,” but I think that made it a bit confusing, so I went heavy on the distinction here—but perhaps more heavy than I actually think is warranted.
The problem I have with relaxing within-episode and across-episode exploration into each other, though, is precisely the problem I describe in “Safe exploration and corrigibility,” however, which is that by default you only end up with capability exploration not objective exploration—that is, an agent with a goal (i.e. a mesa-optimizer) is only going to explore to the extent that it helps its current goal, not to the extent that it helps it change its goal to be more like the desired goal. Thus, you need to do something else (something that possibly looks somewhat like corrigibility) to get the agent to explore in such a way that helps you collect data on what its goal is and how to change it.
Hey Aray!
I agree with this. I jumped the gun a bit in not really making the distinction clear in my earlier post “Safe exploration and corrigibility,” but I think that made it a bit confusing, so I went heavy on the distinction here—but perhaps more heavy than I actually think is warranted.
The problem I have with relaxing within-episode and across-episode exploration into each other, though, is precisely the problem I describe in “Safe exploration and corrigibility,” however, which is that by default you only end up with capability exploration not objective exploration—that is, an agent with a goal (i.e. a mesa-optimizer) is only going to explore to the extent that it helps its current goal, not to the extent that it helps it change its goal to be more like the desired goal. Thus, you need to do something else (something that possibly looks somewhat like corrigibility) to get the agent to explore in such a way that helps you collect data on what its goal is and how to change it.