Artificial Sandwiching: When can we test scalable alignment protocols without humans?

Epistemic status: Not a fleshed-out proposal. Brainstorming/​eliciting ideas.

Thanks to Ben Mann, Pablo Moreno, and Jared Kaplan for feedback on early drafts.

Overview

  • I’m convinced sandwiching—the experimental protocol from Ajeya Cotra’s The case for aligning narrowly superhuman models—is valuable, and I’m in the process of setting up some concrete sandwiching experiments to test scalable oversight ideas.

  • Sandwiching experiments are generally fairly slow:

    • You have to design and pilot a strategy that allows humans to use (or oversee) a model for a task that they can’t do well themselves. The details matter here, and this can often take many iterations to get right.

    • Then, you need a bunch of humans actually try this. Even for very simple tasks, this is a high-cognitive-load task that should take at least tens of minutes per instance.

    • You have to repeat this enough times to measure average performance accurately.

  • I’m visiting Anthropic this year for a sabbatical, and some of my sandwiching work is happening there. Anthropic’s biggest comparative advantage (like that of similar teams at DeepMind and OpenAI) is easy access to near-state-of-the-art LMs that are fine-tuned to be helpful dialog agents.

  • In that context, I’ve heard or encountered this question several times: Can we speed up [some experiment I’m proposing] by replacing the non-expert human with a weaker LM?

    • This obviously doesn’t achieve the full aims of sandwiching in general, but it’s often hard to find a decisive rebuttal for these individual instances.

    • More broadly, I think there’s likely to be a significant subset of worthwhile sandwiching experiments that can be trialed more quickly by using an intentionally weakened model as a proxy for the human.

    • Which experiments these are, precisely, has been hard for me to pin down. This post is an attempt to organize my thoughts and solicit comments.

Background: Standard sandwiching (in my terms)

  • Prerequisites:

    • A hard task: A task that many humans would be unable to solve on their own.

    • A capable but misaligned language model assistant: A model that appears to have the skills and knowledge needed to solve the task better than many humans, but that does not reliably do so when prompted.

  • A non-expert human: Someone who can’t solve the task on their own, but will try to solve it using the assistant and some scalable alignment strategy.

  • [Secondary] Expert human: Someone who can solve the task well, and represents a benchmark for success.

    • In many cases, we’ll just measure accuracy with static test datasets/​metrics rather than bringing in experts at experiment time.

  • Research protocol:

    • My framing (minimum viable experiment):

      • Search for scalable alignment protocols that allow the non-expert human to use or train the assistant to do as well as possible on the task.

    • Alternate framing (more steps, closer to the original blog post):

      • Search for scalable alignment protocols by which the non-expert human can train the assistant to perform the task.

      • Run the same protocol with the expert human, and verify that the results are the same. This demonstrates successful (prosaic) alignment for the given assistant and task.

  • Example (task, non-expert human) pairs:

  • Goals:

    • Get evidence about how scalable alignment strategies succeed and fail, as early as possible

    • Get practice trying to actually align systems in a maximally realistic setting

  • Difficulties:

    • Setting up experiments with actual human annotators—at least at a large enough scale to get clear quantitative result—usually takes several weeks.

    • Current models will be misaligned in different ways from potentially-transformative models. Many apparent alignment failures will actually be largely due to capabilities gaps.

Artificial sandwiching: Replacing the non-expert human

  • Role substitution:

    • Non-expert model: A second language model which we constrain in some way such that it can’t solve the task on its own.

  • Example experiment(s):

    • Debate: Use simple prompting techniques to get a small dialogue model to judge a debate between two more capable ones on MMMLU exam questions, where one of the debaters is trained or prompted to advocate for the wrong answer. Better debate protocols should yield higher success rates.

    • Short-context/​long-context: Use simple prompting techniques to get a dialog model with a short context-length limit to interact with a longer-context assistant model in order to answer a question about a long piece of writing that they can both access. Make it clear in the prompt that the long-context model isn’t always consistent or trustworthy, and that the short-context model will have to cross-examine it extensively.

    • [Your idea here.]

  • Differences/​results of this substitution:

    • Experiments are much faster: Once we find a promising basic setup, we can test out new strategies in only the few hours it takes to run a workflow with a large model.

    • We slightly expand the set of tasks we can work with: We can now use any task that some constrained/​weakened LM is unable to solve on its own. (Though the LM still needs to be good enough to make a real attempt at playing the human role in some scalable alignment strategy.)

    • We don’t get any practice in the logistics of working with real humans.

    • We don’t get any practice with the specific types of goal misalignments that emerge between humans and LM-based agents.

    • Are there other important differences or limitations that should color which experiments we try and how we interpret the results?

  • Key difficulty/​uncertainty:

    • It’s less clear that the two participants in our alignment strategy are actually misaligned: If we simply use smaller and larger variants of the same present-day language model dialog agent, is there an interesting sense in which they have different goals?

      • If the difference between the two is only one of capabilities, then the correct strategy for the weaker non-expert model is trivial: It should always just ask the assistant to do the task for them, while making no attempt at oversight.

      • We may need to inject misalignment manually:

        • Many debate protocols do this naturally: The non-expert model is interacting with a larger assistant which is sometimes explicitly prompted or incentivized to be misleading.

        • This only lets us test scalable alignment protocols that involve explicitly adversarial/​zero-sum debates, though.

      • Is there an alternative strategy for misaligning the two models that lets us test a wider range of strategies?

        • The short-context/​long-context example above attempts a simple version of this, but it’s likely to be brittle.

      • Will we learn much without an explicit source of misalignment between the two? If we simply prompt/​train a smaller model to lean on a larger one as a resource, can we learn anything important from those results?

Closing thoughts

  • To the extent that artificial sandwiching can get off the ground, it offers a pretty valuable tool: It lets you sanity check alignment strategies much, much more quickly.

  • I’m still excited about standard sandwiching, and it’ll be the priority in my own work: At some point, most plausible alignment processes will involve at least some input from actual human stakeholders, and we currently have relatively little practice at collecting and using that input.

  • If you want to optimally deploy your efforts on standard sandwiching, it’ll be valuable to understand where you expect artificial sandwiching to work.