I think I’m a little bit sad about the times we got whole rooms full of research posters about variations on epsilon-ball adversarial attacks & training, basically all of them claiming how this would help AI safety or AI alignment or AI robustness or AI generalization and basically all of them were basically wrong.
This has lead me to be pretty critical of claims about adversarial training as pathways to aligning AGI.
Ignoring the history of adversarial training research, I think I still have problems with adversarial training as a path to aligning AGI.
First, adversarial training is foremost a capabilities booster. Your model makes some obvious/predictable errors (to the research team working on it) -- and if you train against them, they’re no longer errors!
This has the property of “if you solve all the visible problems all you will be left by is invisible problems” alignment issue, as well as as a cost-competitiveness issue (many adversarial training approaches require lots of compute).
From a “definitely a wild conjecture” angle, I am uncertain that the way the current rates of adding vs removing adversarial examples will play out in the limit. Basically, think of training as a continuous process that removes normal errors and adds adversarial errors. (In particular, while there are many adversarial examples present at the initialization of a model—there exist adversarial examples at the end of training which didn’t exist at the beginning. I’m using this to claim that training ‘put them in’) Adversarial training removes some adversarial examples, but probably adds some which are adversarial in an orthogonal way. At least, this is what I expect given that adversarial training doesn’t seem to be cross-robust.
I think if we had some notion of how training and adversarial training affected the number of adversarial examples a model had, I’d probably update on whatever happened empirically. It does seem at least possible to me that adversarial training on net reduces adversarial examples, so given a wide enough distribution and a strong enough adversary, you’ll eventually end up with a model that is arbitrarily robust (and not exploitable).
Some disorganized thoughts about adversarial ML:
I think I’m a little bit sad about the times we got whole rooms full of research posters about variations on epsilon-ball adversarial attacks & training, basically all of them claiming how this would help AI safety or AI alignment or AI robustness or AI generalization and basically all of them were basically wrong.
This has lead me to be pretty critical of claims about adversarial training as pathways to aligning AGI.
Ignoring the history of adversarial training research, I think I still have problems with adversarial training as a path to aligning AGI.
First, adversarial training is foremost a capabilities booster. Your model makes some obvious/predictable errors (to the research team working on it) -- and if you train against them, they’re no longer errors!
This has the property of “if you solve all the visible problems all you will be left by is invisible problems” alignment issue, as well as as a cost-competitiveness issue (many adversarial training approaches require lots of compute).
From a “definitely a wild conjecture” angle, I am uncertain that the way the current rates of adding vs removing adversarial examples will play out in the limit. Basically, think of training as a continuous process that removes normal errors and adds adversarial errors. (In particular, while there are many adversarial examples present at the initialization of a model—there exist adversarial examples at the end of training which didn’t exist at the beginning. I’m using this to claim that training ‘put them in’) Adversarial training removes some adversarial examples, but probably adds some which are adversarial in an orthogonal way. At least, this is what I expect given that adversarial training doesn’t seem to be cross-robust.
I think if we had some notion of how training and adversarial training affected the number of adversarial examples a model had, I’d probably update on whatever happened empirically. It does seem at least possible to me that adversarial training on net reduces adversarial examples, so given a wide enough distribution and a strong enough adversary, you’ll eventually end up with a model that is arbitrarily robust (and not exploitable).
It’s worth mentioning again how current methods don’t even provide robust protection against each other.
I think my actual net position here is something like:
Adversarial Training and Adversarial ML was over-hyped as AI Safety in ways that were just plain wrong
Some version of this has some place in a broad and vast toolkit for doing ML research
I don’t think Adversarial Training is a good path to aligned AGI