This notion of dangerousness seems strongly related to corrigibility. To demonstrate, imagine an attempt by the user to shut down the AI. Suppose that the AI has 3 strategies with which to respond: (i) comply with the shut down (ii) resist defensively, i.e. prevent shutdown but without irreversible damaging anything (iii) resist offensively, e.g. by doing something irreversible to the user that will cause em to stop trying to shut down the AI. The baseline policy is complying. Then, assuming that the user’s stated beliefs endorse the shutdown, an AI with low dangerousness should at most resist defensively for a short period and then comply. That’s because resisting offensively would generate high dangerousness by permanent loss of value, whereas resisting defensively for a long time would generate high dangerousness by losing reward over that period...
This notion of dangerousness opens the way towards designing AI systems that are provably safe while at the same time employing heuristic algorithms without theoretical understanding. Indeed, as long as the AI has sufficiently low dangerousness, it will almost certainly not cause catastrophic damage.
This seems quite close (or even identical) to attainable utility preservation; if I understand correctly, this echoes arguments I’ve made for why AUP has a good shot of avoiding catastrophes and thereby getting you something which feels similar to corrigibility.
There is some similarity, but there are also major differences. They don’t even have the same type signature. The dangerousness bound is a desideratum that any given algorithm can either satisfy or not. On the other hand, AUP is a specific heuristic how to tweak Q-learning. I guess you can consider some kind of regret bound w.r.t. the AUP reward function, but they will still be very different conditions.
The reason I pointed out the relation to corrigibility is not because I think that’s the main justification for the dangerousness bound. The motivation for the dangerousness bound is quite straightforward and self-contained: it is a formalization of the condition that “if you run this AI, this won’t make things worse than not running the AI”, no more and no less. Rather, I pointed the relation out to help readers compare it with other ways of thinking they might be familiar with.
From my perspective, the main question is whether satisfying this desideratum is feasible. I gave some arguments why it might be, but there are also opposite arguments. Specifically, if you believe that debate is a necessary component of Dialogic RL then it seems like the dangerousness bound is infeasible. The AI can become certain that the user would respond in a particular way to a query, but it cannot become (worst-case) certain that the user would not change eir response when faced with some rebuttal. You can’t (empirically and in the worst-case) prove a negative.
This seems quite close (or even identical) to attainable utility preservation; if I understand correctly, this echoes arguments I’ve made for why AUP has a good shot of avoiding catastrophes and thereby getting you something which feels similar to corrigibility.
There is some similarity, but there are also major differences. They don’t even have the same type signature. The dangerousness bound is a desideratum that any given algorithm can either satisfy or not. On the other hand, AUP is a specific heuristic how to tweak Q-learning. I guess you can consider some kind of regret bound w.r.t. the AUP reward function, but they will still be very different conditions.
The reason I pointed out the relation to corrigibility is not because I think that’s the main justification for the dangerousness bound. The motivation for the dangerousness bound is quite straightforward and self-contained: it is a formalization of the condition that “if you run this AI, this won’t make things worse than not running the AI”, no more and no less. Rather, I pointed the relation out to help readers compare it with other ways of thinking they might be familiar with.
From my perspective, the main question is whether satisfying this desideratum is feasible. I gave some arguments why it might be, but there are also opposite arguments. Specifically, if you believe that debate is a necessary component of Dialogic RL then it seems like the dangerousness bound is infeasible. The AI can become certain that the user would respond in a particular way to a query, but it cannot become (worst-case) certain that the user would not change eir response when faced with some rebuttal. You can’t (empirically and in the worst-case) prove a negative.