comment TLDR: Adversarial examples are a weapon against the AIs we can use for good and solving adversarial robustness would let the AIs harden themselves.
I haven’t read this yet (I will later : ) ), so it’s possible this is mentioned, but I’d note that exploiting the lack of adversarial robustness could also be used to improve safety. For instance, AI systems might have a hard time keeping secrets if they also need to interact with humans trying to test for verifiable secrets. E.g., trying to jailbreak AIs to get them to tell you about the fact that they already hacked the ssh server.
Ofc, there are also clownier versions like trying to fight the robot armies with adversarial examples (naively seems implausible, but who knows). I’d guess that reasonably active militaries probably already should be thinking about making their camo also be a multi-system adversarial example if possible. (People are already using vision models for targeting etc.)
I don’t think this lack of adversarial robustness gets you particularly strong alignment properties, but it’s worth noting upsides as well as downsides.
Perhaps we’d ideally train our AIs to be robust to most classes of things but just ensure they remain non-robust in a few (ideally secret) edge cases. However, the most important use cases for AI exploits are in situations after AIs have already started to be able to apply training procedures to themselves without our consent.
My overall guess is that the upsides of solving adversarial robustness outweigh the downsides.
This is a good point, adversarial examples in what I called in the post the “main” ML system can be desirable even though we typically don’t want them in the “helper” ML systems used to align the main system.
One downside to adversarial vulnerability of the main ML system is that it could be exploited by bad actors (whether human or other, misaligned AIs). But this might be fine in some settings: e.g. you build some airgapped system that helps you build the next, more robust and aligned AI. One could also imagine crafting adversarial example backdoors that are cryptographically hard to discover if you don’t know how they were constructed.
I generally expect that if adversarial robustness can be practically solved then transformative AI systems will eventually self-improve themselves to the point of being robust. So, the window where AI systems are dangerous & deceptive enough that we need to test them using adversarial examples but not capable enough to have overcome this might be quite short. Could still be useful as an early-warning sign, though.
comment TLDR: Adversarial examples are a weapon against the AIs we can use for good and solving adversarial robustness would let the AIs harden themselves.
I haven’t read this yet (I will later : ) ), so it’s possible this is mentioned, but I’d note that exploiting the lack of adversarial robustness could also be used to improve safety. For instance, AI systems might have a hard time keeping secrets if they also need to interact with humans trying to test for verifiable secrets. E.g., trying to jailbreak AIs to get them to tell you about the fact that they already hacked the ssh server.
Ofc, there are also clownier versions like trying to fight the robot armies with adversarial examples (naively seems implausible, but who knows). I’d guess that reasonably active militaries probably already should be thinking about making their camo also be a multi-system adversarial example if possible. (People are already using vision models for targeting etc.)
I don’t think this lack of adversarial robustness gets you particularly strong alignment properties, but it’s worth noting upsides as well as downsides.
Perhaps we’d ideally train our AIs to be robust to most classes of things but just ensure they remain non-robust in a few (ideally secret) edge cases. However, the most important use cases for AI exploits are in situations after AIs have already started to be able to apply training procedures to themselves without our consent.
My overall guess is that the upsides of solving adversarial robustness outweigh the downsides.
This is a good point, adversarial examples in what I called in the post the “main” ML system can be desirable even though we typically don’t want them in the “helper” ML systems used to align the main system.
One downside to adversarial vulnerability of the main ML system is that it could be exploited by bad actors (whether human or other, misaligned AIs). But this might be fine in some settings: e.g. you build some airgapped system that helps you build the next, more robust and aligned AI. One could also imagine crafting adversarial example backdoors that are cryptographically hard to discover if you don’t know how they were constructed.
I generally expect that if adversarial robustness can be practically solved then transformative AI systems will eventually self-improve themselves to the point of being robust. So, the window where AI systems are dangerous & deceptive enough that we need to test them using adversarial examples but not capable enough to have overcome this might be quite short. Could still be useful as an early-warning sign, though.