JanB comments on High-stakes alignment via adversarial training [Redwood Research report]

JanB 7 Jun 2022 16:10 UTC
LW: 10 AF: 6
AF
Have you tried using automated adversarial attacks (common ML meaning) on text snippets that are classified as injurious but near the cutoff? Especially adversarial attacks that aim to retain semantic meaning. E.g. with a framework like TextAttack?
In the paper, you write: “There is a large and growing literature on both adversarial attacks and adversarial training for large language models [31, 32, 33, 34]. The majority of these focus on automatic attacks against language models. However, we chose to use a task without an automated source of ground truth, so we primarily used human attackers.”
But my best guess would be that if you use an automatic adversarial attack on a snippet that humans say is injurious, the result will quite often still be a snippet that humans say is injurious.