Quick thoughts on “scalable oversight” /​ “super-human feedback” research

The current default view seems to roughly be:

  • Inner alignment is more important than outer alignment (or, alternatively, this distinction is bad/​sub-optimal, but basically it’s all about generalizing correctly)

  • Scalable oversight is the only useful form of outer alignment research remaining.

  • We don’t need to worry about sample efficiency in RLHP—in the limit we just pay everyone to provide feedback, and in practice even a few thousand samples (or a “constition”) seems ~good enough.

  • But maybe it’s not good? Because it’s more like capabilities research?

A common example used for motivating scalable oversight is the “AI CEO”.

My views are:

  • We should not be aiming to build AI CEOs

  • We should be aiming to robustly align AIs to perform “simpler” behaviors that unaided humans (or humans aided with more conventional tools, not, e.g. AI systems trained with RL to do highly interpretive work) feel they can competently judge.

  • We should aim for a situation where there is broad agreement against building AIs with more ambitious alignment targets (e.g. AI CEOs).

  • From this PoV, scalable oversight does in fact look mostly like capabilities research.

  • However, scalable oversight research can still be justified because “If we don’t, someone else will”. But this type of replaceability argument should always be treated with extreme caution. The reality is more complex: 1) there will be tipping points where it suddenly ceases to apply, and your individual actions actually have a large impact on norms. 2) The details matter, and the tipping points are in different places for different types of research/​applications, etc.

  • It may also make sense to work on scalable oversight in order to increase robustness of AI performance on tasks humans feel they can competently judge (“robustness amplification”). For instance, we could use unaided human judgments and AI-assisted human judgments as safety filters, and not deploy a system unless both processes conclude it is safe.

  • Getting AI systems to safely perform simpler behaviors safely remains an important research topic, and will likely require improving sample efficiency; the sum total of available human labor will be insufficient for robust alignment, and we probably need to use different architectures /​ hybrid systems of some form as well.

  • EtA: the main issue I have with scalable oversight is less that it is advancing capabilities, per se, and more that it seems to raise a “chicken-and-egg” problem, i.e. the arguments for safety/​alignment end up being somewhat circular: “this system is safe because the system we used as an assistant was safe” (but I don’t think we’ve solved the “build a safe assistant” part yet, i.e. we don’t have the base case for the induction).