Arguments for Robustness in AI Alignment

Most popular robustness papers address short-term problems with robustness failures in neural networks, e.g., in the context of autonomous driving. I believe that robustness is also critical for the long-term success of AI alignment and ending the discussion with a statement along the lines of “Meh, this doesn’t seem solvable; also, our benchmarks are looking good and more capable models won’t be fooled like this” doesn’t cut it.

Reasons to be concerned:

Epistemic status: I thought a lot about robustness for the past half year or so and found the arguments convincing. However, there are certainly more arguments, and I’ll keep updating the list. Ideally, this will eventually turn into a quick survey of the arguments to quantify what people are actually concerned about.

- Remote code execution through jailbreaks seems obviously bad and is treated as a crucial threat in IT security.

- Failure to manage and maintain the resulting complexity when AI are part of larger systems deployed in complex and potentially adversarial settings, e.g., with other AIs.

- There is evidence that humans react to similar attacks like ensembles of CV models, which could enable the creation of super-stimuli.

- Transfer attacks like GCG will likely continue to be possible as there aren’t many internet-scale text datasets. Hence, it seems plausible that multiple models will share similar failure modes and use the same non-robust predictive features.

- We don’t have an angle on the problem that addresses the cause at scale, only empirically useful treatment of symptoms like adversarial training outside the scope of -attacks, which can be handled with certifiable methods.

- Just because we fail to find adversarial examples empirically doesn’t mean there aren’t any. Humans have been able to semantically jailbreak LLMs with patterns that no optimizer found yet. Thus, it seems plausible that we don’t even know how to test if LLMs are robust.

- It might not matter if advanced AI systems share our values and act in humanity’s best interest if models are vulnerable to “erratically” misbehaving badly and are leveraged sufficiently.

- As long as LLMs become more capable while remaining vulnerable, it seems likely that Yudkowsky’s quip of Moore’s Law of Mad Science: “Every 18 months, the minimum IQ to destroy the world drops by one point” might be accurate.

Acknowledgements:
Thanks to @Lukas Fluri for the feedback and pointers!