William_S comments on High-stakes alignment via adversarial training [Redwood Research report]

William_S 8 May 2022 17:12 UTC
LW: 5 AF: 4
0
AF
Take after talking with Daniel: for future work I think it will be easier to tell how well your techniques are working if you are in a domain where you care about minimizing both false-positive and false-negative error, regardless of whether that’s analagous to the long term situation we care most about. If you care about both kinds of error then the baseline of “set a reallly low classifier threshold” wouldn’t work, so you’d be starting from a regime where it was a lot easier to sample errors, hence it will be easier to measure differences in performance.
- dmz 8 May 2022 20:24 UTC
  LW: 3 AF: 2
  0
  AF Parent
  Yeah, I think that might have been wise for this project, although the ROC plot suggests that the classifiers don’t differ much in performance even at noticeably higher thresholds.
  For future projects, I think I’m most excited about confronting the problem directly by building techniques that can succeed in sampling errors even when they’re extremely rare.