I think this is a very cool approach and useful toy alignment problem. I’m interested in your automated toolset for generating adversaries to the classifier. My recent PhD work has been in automatically generating text counterfactuals, which are closely related to adveraaries, but less constrained in the modifications they make. My advisor and I published a paper with a new whitebox method for generating counterfactuals.
For generating text adversaries specifically, one great tool I’ve found is TextAttack. It implements many recent text adversary methods in a common framework with a straightforward interface. Current text adversary methods aim to fool the classifier without changing the “true” class humans assign. They aim to do this by imposing various constraints on the modifications allowed for the adversarial method. E.g., a common constraint is to force a minimum similarity between the encodings of the original and adversary using some model like the Universal Sentence Encoder. This is supposed to make the adversarial text look “natural” to humans while still fooling the model. I think that’s somewhat like what you’re aiming for with “strategically hidden” failure cases.
I’m also very excited about this work. Please let me know if you have any questions about adversarial/counterfactual methods in NLP and if there are any other ways I can help!
I think this is a very cool approach and useful toy alignment problem. I’m interested in your automated toolset for generating adversaries to the classifier. My recent PhD work has been in automatically generating text counterfactuals, which are closely related to adveraaries, but less constrained in the modifications they make. My advisor and I published a paper with a new whitebox method for generating counterfactuals.
For generating text adversaries specifically, one great tool I’ve found is TextAttack. It implements many recent text adversary methods in a common framework with a straightforward interface. Current text adversary methods aim to fool the classifier without changing the “true” class humans assign. They aim to do this by imposing various constraints on the modifications allowed for the adversarial method. E.g., a common constraint is to force a minimum similarity between the encodings of the original and adversary using some model like the Universal Sentence Encoder. This is supposed to make the adversarial text look “natural” to humans while still fooling the model. I think that’s somewhat like what you’re aiming for with “strategically hidden” failure cases.
I’m also very excited about this work. Please let me know if you have any questions about adversarial/counterfactual methods in NLP and if there are any other ways I can help!