One needs to be very careful when proposing a heuristically-motivated adversarial defense. Otherwise, one runs the risk of being the next example in a paper like this one, which took a wide variety of heuristic defense papers and showed that stronger attacks completely break most of the defenses.
In fact, Section 5.2.2 completely breaks the defense from a previous paper that considered image cropping and rescaling, bit-depth reduction, and JPEG compression as baseline defenses, in addition to new defense methods called total variance minimization and image quilting. While the original defense paper claimed 60% robust top-1 accuracy, the attack paper totally circumvents the defense, dropping accuracy to 0%.
If compression was all you needed to solve adversarial robustness, then the adversarial robustness literature would already know this fact.
My current understanding is that an adversarial defense paper should measure certified robustness to ensure that no future attack will break the defense. This paper gives an example of certified robustness. However, I’m not a total expert in adversarial defense methodology, so I’d encourage anyone considering writing an adversarial defense paper to talk to such an expert.
Did you try searching for similar ideas to your work in the broader academic literature? There seems to be lots of closely related work that you’d find interesting. For example:
Elite BackProp: Training Sparse Interpretable Neurons. They train CNNs to have “class-wise activation sparsity.” They claim their method achieves “high degrees of activation sparsity with no accuracy loss” and “can assist in understanding the reasoning behind a CNN.”
Accelerating Convolutional Neural Networks via Activation Map Compression. They “propose a three-stage compression and acceleration pipeline that sparsifies, quantizes, and entropy encodes activation maps of Convolutional Neural Networks.” The sparsification step adds an L1 penalty to the activations in the network, which they do at finetuning time. The work just examines accuracy, not interpretability.
Enhancing Adversarial Defense by k-Winners-Take-All. Proposes the k-Winners-Take-All activation function, which keeps only the k largest activations and sets all other activations to 0. This is a drop-in replacement during neural network training, and they find it improves adversarial robustness in image classification. How Can We Be So Dense? The Benefits of Using Highly Sparse Representations also uses the k-Winners-Take-All activation function, among other sparsification techniques.
The Neural LASSO: Local Linear Sparsity for Interpretable Explanations. Adds an L1 penalty to the gradient wrt the input. The intuition is to make the final output have a “sparse local explanation” (where “local explanation” = input gradient)
Adaptively Sparse Transformers. They replace softmax with α-entmax, “a differentiable generalization of softmax that allows low-scoring words to receive precisely zero weight.” They claim “improve[d] interpretability and [attention] head diversity” and also that “at no cost in accuracy, sparsity in attention heads helps to uncover different head specializations.”
Interpretable Neural Predictions with Differentiable Binary Variables. They train two neural networks. One “selects a rationale (i.e. a short and informative part of the input text)”, and the other “classifies… from the words in the rationale alone.”
I ask because your paper doesn’t seem to have a related works section, and most of your citations in the intro are from other safety research teams (eg Anthropic, OpenAI, CAIS, and Redwood.)