ryan_greenblatt comments on Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy

ryan_greenblatt 26 Jul 2023 20:50 UTC
LW: 9 AF: 5
4
AF
This post doesn’t discuss how to do oversight, just how to evaluate whether or not your oversight is working.

I’m not really sure what you mean by “oversight, but add an epicycle” or how to determine if this is a good summary.

Consider a type of process for testing whether or not planes are safe. You could call this “safe planes, but add an epicycle”, but I’m not sure what this communicates.
- ryan_greenblatt 26 Jul 2023 21:00 UTC
  LW: 11 AF: 6
  8
  AF Parent
  There’s a separate question of “is this post interesting?”.
  
  I don’t think this post is particularly interesting or inventive: the basic arguments here aren’t very far from what I came up with after thinking about how to evaluate oversight for something like 5 hours.
  
  But, I haven’t seen this exact approach discussed in detail and it does seems like these arguments are an important part of why I expect evaluating average oversight quality to be reasonably technically feasible. (Note that evaluating average oversight behavior isn’t sufficient for evaluating whether or not a given training and deployment approach will result in good outcomes.)
  
  If you think that empirically evaluating oversight is useful, my guess is that the basic arguments and approach discussed in this post are important (though possibly obvious).
- johnswentworth 26 Jul 2023 21:32 UTC
  LW: -2 AF: -1
  −14
  AF Parent
  I’m not really sure what you mean by “oversight, but add an epicycle” or how to determine if this is a good summary.
  Something like: the OP is proposing oversight of the overseer, and it seems like the obvious next move would be to add an overseer of the oversight-overseer. And then an overseer of the oversight-oversight-overseer. Etc.
  And the implicit criticism is something like: sure, this would probably marginally improve oversight, but it’s a kind of marginal improvement which does not really move us closer in idea-space to whatever the next better paradigm will be which replaces oversight (and is therefore not really marginal progress in the sense which matters more). In the same way that adding epicycles to a model of circular orbits does make the model match real orbits marginally better (for a little while, in a way which does not generalize to longer timespans), but doesn’t really move closer in idea space to the better model which eventually replaces circular orbits (and the epicycles are therefore not really marginal progress in the sense which matters more).
  - Buck 27 Jul 2023 15:51 UTC
    LW: 10 AF: 5
    3
    AF Parent
    the OP is proposing oversight of the overseer,
    I don’t think this is right, at least in the way I usually use the terms. We’re proposing a strategy for conservatively estimating the quality of an “overseer” (i.e. a system which is responsible for estimating the goodness of model actions). I think that you aren’t summarizing the basic point if you try to use the word “oversight” for both of those.
    - johnswentworth 27 Jul 2023 16:30 UTC
      LW: 2 AF: 2
      0
      AF Parent
      That’s useful, thanks.