the first paper observes a phenomenon where adversarial accuracy and normal accuracy are at odds with each other. the authors present a toy example to explain this.
the construction involves giving each input one channel that is 90% accurate for predicting the binary label, and a bazillion iid gaussian channels that are as noisy as possible individually, so that when you take the average across all of them you get ~100% accuracy. they show that when you do ℓ∞-adversarial training on the input you learn to only use the 90% accurate feature, whereas normal training uses all bazillion weak channels.
the key to this construction is that they consider an ℓ∞-ball on the input (distance is the max across all coordinates). so this means by adding more and more features, you can move further and further in ℓ2 space (specifically, √n in terms of the number of features). but the ℓ2 distance between the means of the two high dimensional gaussians stays constant, so no matter what your ε is, at some point with enough channels you can perturb anything from one class into the other class and vice versa.
in the second paper, the authors do further experiments on real models to show that you can separate out the robust features and the unrobust ones, and recombine them into frankenstein images that look like dogs to humans but cats to the unrobust model and dogs to the robust model.
they also generalize the toy example in the previous paper. they argue that in general, adversarial examples arise exactly when the adversarial attack metric and the loss metric differ. in other words, the loss function (and downstream part of the model, in a multilayer model) implies some loss surface around any data point, and some directions on that surface will be a lot more important for the loss than some other directions. but your ε ball (in, say, ℓ2) that you do your attack in will treat all those directions equally importantly. so you can pick the direction that maximizes the amount of loss change.
their new example is a classification task on two features, where the two classes are very stretched out gaussians placed diagonally from each other, so that a ℓ2 ball from each mean reaches into the distribution of the other gaussian. during normal training, the classification boundary learned falls right along the line where the mahalanobis distance from the two means is the same (intuitively, the classification boundary falls along exactly those points where a data point is equally likely to be sampled from either distribution.) but this is different from ℓ2 norm! it treats distances along the low-variance axis of the gaussian as being much larger, so it doesn’t mind putting the boundary close (in ℓ2 norm) to the mean. this lets the ℓ2 perturbation step over the boundary.
my summary of these two papers: https://arxiv.org/pdf/1805.12152 https://arxiv.org/pdf/1905.02175
the first paper observes a phenomenon where adversarial accuracy and normal accuracy are at odds with each other. the authors present a toy example to explain this.
the construction involves giving each input one channel that is 90% accurate for predicting the binary label, and a bazillion iid gaussian channels that are as noisy as possible individually, so that when you take the average across all of them you get ~100% accuracy. they show that when you do ℓ∞-adversarial training on the input you learn to only use the 90% accurate feature, whereas normal training uses all bazillion weak channels.
the key to this construction is that they consider an ℓ∞-ball on the input (distance is the max across all coordinates). so this means by adding more and more features, you can move further and further in ℓ2 space (specifically, √n in terms of the number of features). but the ℓ2 distance between the means of the two high dimensional gaussians stays constant, so no matter what your ε is, at some point with enough channels you can perturb anything from one class into the other class and vice versa.
in the second paper, the authors do further experiments on real models to show that you can separate out the robust features and the unrobust ones, and recombine them into frankenstein images that look like dogs to humans but cats to the unrobust model and dogs to the robust model.
they also generalize the toy example in the previous paper. they argue that in general, adversarial examples arise exactly when the adversarial attack metric and the loss metric differ. in other words, the loss function (and downstream part of the model, in a multilayer model) implies some loss surface around any data point, and some directions on that surface will be a lot more important for the loss than some other directions. but your ε ball (in, say, ℓ2) that you do your attack in will treat all those directions equally importantly. so you can pick the direction that maximizes the amount of loss change.
their new example is a classification task on two features, where the two classes are very stretched out gaussians placed diagonally from each other, so that a ℓ2 ball from each mean reaches into the distribution of the other gaussian. during normal training, the classification boundary learned falls right along the line where the mahalanobis distance from the two means is the same (intuitively, the classification boundary falls along exactly those points where a data point is equally likely to be sampled from either distribution.) but this is different from ℓ2 norm! it treats distances along the low-variance axis of the gaussian as being much larger, so it doesn’t mind putting the boundary close (in ℓ2 norm) to the mean. this lets the ℓ2 perturbation step over the boundary.