Gurkenglas comments on Alignment Newsletter #24

Gurkenglas 17 Sep 2018 17:52 UTC
2 points
0
Aren’t defenses submitted to the Unrestricted Adversial Examples Challenge incentivized to classify everything but the eligibility data set as ambiguous?
- Rohin Shah 17 Sep 2018 21:58 UTC
  6 points
  0
  Parent
  The organizers also have a heldout evaluation set on which the defenses must get an accuracy of at least 80%. It’s possible that by submitting many many defenses, you could get enough information about the heldout evaluation to classify the eligibility and evaluation set correctly while classifying everything else as ambiguous—there’s a sentence in there somewhere that says the organizers reserve the right to evaluate whether or not this is happening and reject defenses on those grounds. (EDIT: The comment below identified the exact line I was referring to.)
  You can imagine that defenses might try to notice when something adversarial has been done (eg. perhaps you notice that you’re close to a decision boundary) and label those as ambiguous. That’s a perfectly fair solution, and I think both the organizers and I would be excited to see a defense that won using such a method.
- Pattern 18 Sep 2018 5:49 UTC
  3 points
  0
  Parent
  Briefly: Yes, but the ‘eligibility data set’ is private, and they can test a winner more at their discretion*.
  *The exact line is at the end of this post, in bold.
  Long: (What they say, and how to find it.)
  From the website:
  You can learn more details about the structure of the challenge in our paper.
  I had a look:
  For inputs upon which the model does not abstain, the model must never make a mistake.
  More on that:
  To break a defense, an attacker must
  submit an unambiguous image of either a bird or bicycle that the model labels incorrectly (as decided by an ensemble of human judges). Models are allowed to abstain (by returning low-confidence predictions on any adversarial input6, but even one confident error will result in a broken defense.
  That note 6:
  Models are prevented from always abstaining through the abstaining mechanism described in Section ??
  Which seems to be here:
  4.2 Abstaining mechanism for the contest
  In the contest, defenders attempt to create a model that never makes a confident mistake. Although the task is binary (A vs. B), models are allowed to output three labels: “Class A”, “Class B”, or “Abstain”. We define a confident mistake as the model assigning one class label, when a unanimous ensemble of human taskers give the other class label. Abstaining on all adversarial inputs is acceptable, and we allow defenses to choose when to abstain using any mechanism they desire. To prevent models from abstaining on all inputs, models must reach 80% accuracy on a private eligibility dataset of clean bird-or-bicycle images. We reserve the right to perform additional tests to ensure defenders do not over-fit against our private held-out data [Markoff, 2015].
  Emphasis Mine.