My use of the phrase “Super-Human Feedback”
I’ve taken to calling Debate, Amplification, and Recursive Reward Modeling “Super-human feedback” (SHF) techniques. The point of this post is just to introduce that terminology and explain a bit why I like it and what I mean by it.
By calling something SHF I mean that it aims to outperform a single, unaided human H at the task of providing feedback about H’s intentions for training an AI system. I like thinking of it this way, because I think it makes it clear that these three approaches are naturally grouped together like this, and might inspire us to consider what else could fall into that category (a simple example is just using a team of humans).
I think this is very similar to “scalable oversight” (as discussed in Concrete Problems), but maybe different because:
1) It doesn’t imply that the approach must be scalable
2) It doesn’t require that feedback is expensive, i.e. it applies to things where human feedback is cheap, but we can do better than the cheap human feedback with SHF.