habryka comments on High-stakes alignment via adversarial training [Redwood Research report]

habryka 6 May 2022 17:16 UTC
LW: 53 AF: 24
AF
One of the primary questions that comes to mind for me is “well, did this whole thing actually work?”. If I understand the paper correctly, while we definitely substantially decreased the fraction of random samples that got misclassified (which always seemed very likely to happen, and I am indeed a bit surprised at only getting it to move ~3 OOMs, which my guess is mostly capability related, since you used small models), we only doubled the amount of effort necessary to generate an adversarial counterexample.
A doubling is still pretty substantial, and stacking a few doublings on top of each other could potentially result in safety guarantees, but my guess is the first doubling is a lot cheaper to get, and so I feel tempted to say that at least the set of techniques explored here, did not substantially increase adversarial robustness, unless I am missing something.
This doesn’t mean that there isn’t more potential work in this space that maybe does increase adversarial robustness, and I might be misreading the paper, but when I first heard about this project, I was definitely hoping about some approach that might increase robustness by many orders of magnitude, making it impossible for an unassisted person to generate a counterexample, and very very difficult to generate a counterexample even with transparency tools. But it seems like the difficulty of a human generating a counterexample stayed mostly the same, which was (at least to me) the primary outcome variable I was interested in.
I am curious whether any of the authors disagree with this assessment. My guess is most of you must, since neither the post nor the paper seems to talk about this very much.
- dmz 7 May 2022 16:50 UTC
  LW: 18 AF: 10
  AF Parent
  Excellent question—I wish we had included more of an answer to this in the post.
  I think we made some real progress on the defense side—but I 100% was hoping for more and agree we have a long way to go.
  I think the classifier is quite robust in an absolute sense, at least compared to normal ML models. We haven’t actually tried it on the final classifier, but my guess is it takes at least several hours to find a crisp failure unassisted (whereas almost all ML models you can find are trivially breakable). We’re interested in people giving it a shot! :)
  Part of what’s going on here is that IMO we made more progress on the offense side than the defense side. We didn’t measure that as rigorously, but it seemed like the token substitution tool speeds up the attack by something like 10x. I think the attack tools are one of the parts of the project I’m most excited about, and we might push even harder on them in upcoming projects.
  I think you’re correct that the amount of improvement was pretty limited by capabilities. 4,900 labels (for tool-assisted adversarial examples) just aren’t many for a 304M-parameter model. That’s one reason we don’t find it totally disturbing that the offense is winning for now; with much bigger models sample efficiency should increase dramatically, and the baseline cost of constructing the model will increase as well, making it comparatively cheaper to spend more on adversarial training.
  That said, this was just our first go and I think we can make a lot more progress right away. It’ll help to have a crisper task definition that doesn’t demand that the model understand thousands of possible ways injury could occur. It’ll help to have attacks that can produce higher volumes of data and cover a larger portion of the space. I don’t think we know how to really kill it yet, but I’m excited to keep working on it.
- Charbel-Raphaël 7 May 2022 1:01 UTC
  1 point
  AF Parent
  I do not understand the “we only doubled the amount of effort necessary to generate an adversarial counterexample.”. Aren’t we talking about 3oom?
  - Charbel-Raphaël 7 May 2022 1:08 UTC
    LW: 7 AF: 4
    AF Parent
    Ah, “The tool-assisted attack went from taking 13 minutes to taking 26 minutes per example.”
    
    Interesting. Changing the in-distribution (3oom) does not influences much the out-distribution (*2)
    - paulfchristiano 7 May 2022 19:18 UTC
      LW: 4 AF: 3
      AF Parent
      I think that 3 orders of magnitude is the comparison between “time taken to find a failure by randomly sampling” and “time taken to find a failure if you are deliberately looking using tools.”
      - dmz 8 May 2022 20:25 UTC
        LW: 3 AF: 2
        AF Parent
        I read Oli’s comment as referring to the 2.4% → 0.002% failure rate improvement from filtering.
        paulfchristiano 8 May 2022 22:25 UTC
        LW: 4 AF: 3
        AF Parent
        Ah, that makes sense. But the 26 minutes --> 13 minutes is from adversarial training holding the threshold fixed, right?
        dmz 9 May 2022 17:21 UTC
        LW: 3 AF: 2
        AF Parent
        Indeed. (Well, holding the quality degradation fixed, which causes a small change in the threshold.)