After reading your post, I broadly agree that a lot of control problems will show up first as security problems, and that we should frame control problems as security problems more often. I see the examples you’re giving of control problems showing up as security problems, and individually they seem somewhat compelling, but I’m not convinced of the generalization “control problems will generally show up as security problems first”.
The main argument you give for this is that adversaries will engineer hard situations that will cause AI systems to fail.
This argument relies on the AI itself not having a privileged position from which to cause bad things to happen (a position not available to adversaries).
Consider an AI that fails only when its input contains certain password. Adversaries will have trouble synthesizing such an input; there’s an information asymmetry between the AI and outside adversaries. The analogous security exploit (give the AI data that will cause it to fail when you later show it a password) seems hard to engineer, but maybe I’m not thinking of a way to do it?
I also don’t see how this addresses cases where adversarial inputs are computationally (not just informationally) hard to synthesize, such as when an AI fails when its input contains a hash code collision. There could be an analogous security problem but I don’t see a realistic one. (Perhaps you could see this as the AI having a privileged position in selecting the time at which it fails).
This seems to rely on a “no fast local takeoff” assumption. E.g. it deals with the AI persuading the overseer to do bad things the same way it deals with outside attackers persuading the overseer to do bad things (which doesn’t work if there’s fast local takeoff).
After reading your post, I broadly agree that a lot of control problems will show up first as security problems, and that we should frame control problems as security problems more often. I see the examples you’re giving of control problems showing up as security problems, and individually they seem somewhat compelling, but I’m not convinced of the generalization “control problems will generally show up as security problems first”. The main argument you give for this is that adversaries will engineer hard situations that will cause AI systems to fail.
This argument relies on the AI itself not having a privileged position from which to cause bad things to happen (a position not available to adversaries).
Consider an AI that fails only when its input contains certain password. Adversaries will have trouble synthesizing such an input; there’s an information asymmetry between the AI and outside adversaries. The analogous security exploit (give the AI data that will cause it to fail when you later show it a password) seems hard to engineer, but maybe I’m not thinking of a way to do it?
I also don’t see how this addresses cases where adversarial inputs are computationally (not just informationally) hard to synthesize, such as when an AI fails when its input contains a hash code collision. There could be an analogous security problem but I don’t see a realistic one. (Perhaps you could see this as the AI having a privileged position in selecting the time at which it fails).
This seems to rely on a “no fast local takeoff” assumption. E.g. it deals with the AI persuading the overseer to do bad things the same way it deals with outside attackers persuading the overseer to do bad things (which doesn’t work if there’s fast local takeoff).