ryan_greenblatt comments on Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy

ryan_greenblatt 26 Jul 2023 21:00 UTC
LW: 11 AF: 6
8
AF
There’s a separate question of “is this post interesting?”.

I don’t think this post is particularly interesting or inventive: the basic arguments here aren’t very far from what I came up with after thinking about how to evaluate oversight for something like 5 hours.

But, I haven’t seen this exact approach discussed in detail and it does seems like these arguments are an important part of why I expect evaluating average oversight quality to be reasonably technically feasible. (Note that evaluating average oversight behavior isn’t sufficient for evaluating whether or not a given training and deployment approach will result in good outcomes.)

If you think that empirically evaluating oversight is useful, my guess is that the basic arguments and approach discussed in this post are important (though possibly obvious).