A summary of aligning narrowly superhuman models

This post was written under Evan Hubinger’s direct guidance and mentorship, as a part of the Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program. Where not indicated, ideas are summaries of the documents mentioned, not original contributions. I am thankful for the encouraging approach of the organizers of the program, especially Oliver Zhang.


In the post “The case for aligning narrowly superhuman models”, Ajeya Cotra advocates for performing technical alignment work on currently available large models using human feedback, in domains where (some) humans are already outperformed (in some aspects). The direct benefit of this line of work could be testing out conceptual work, highlighting previously unknown challenges and eliciting new scalable, general solutions, as well as indirectly moving the field of ML in a better direction and building the community and infrastructure/​tooling of AI Safety. This post summarizes the original post, some related work and some objections.

Short intro

Progress in ML could allow us to do technical alignment research that is closer to the “real problem” than specific toy examples. In particular, one of the main promising directions (among others such as Interpretability and Truthful and honest AI) is discovering better methods of giving feedback to models more capable than us. This might be easy for some concrete, narrow tasks, such as playing Go (there is an algorithmic way to decide if a model is playing Go better than another model), but for more “fuzzy” tasks (such as “devise a fair economic policy”), we can’t evaluate the model according to such a gold standard.

A couple of examples of work falling into this category (See Example tasks):

  • Train a language model to accurately explain things about a field that the feedback providers are not familiar with.

  • Train an RL agent to act well in an environment where the RL agent can observe more information than the feedback providers can.

  • Train a multilingual model to translate between English and a foreign language that the feedback providers do not know. [Summary after Alignment Newsletter]

These tasks call for more sophisticated ways of generating feedback for the models. Some of the approaches that can be used include debate or decomposing the question into subtasks. See section Possible approaches to alignment.

To test these methods, Ajeya also introduces the concept of sandwiching: produce a model using data and feedback from humans skilled at a task (to establish a gold standard), then use a model and some of these alignment approaches to achieve the same performance with data and feedback from people not skilled in the area.

The most important benefits this line of work could provide is:

  • Gives feedback to conceptual work: it can be good testing grounds for existing ideas such as scaling human feedback with debate or decomposition, highlights problems with existing approaches, and elicits new proposals

  • Moves the field of ML to a better position: we will have a better understanding of ML models, their shortcomings and perhaps will be able to trust them more. [In particular, if we would like to use the help of ML systems for alignment (as, IDA or automated auditing), we need already reliable assistant systems.]

  • Advances the development of alignment-relevant tools, infrastructure and know-how, such as details of how to work with human evaluators

  • Has field building benefits: there may be engineers and researchers interested in working on AI safety, but prefer technical work [Gabor’s note: kickstarting research in this direction could also make alignment more appealing for academia (especially if there are measurable goals)]

Main critiques of the proposal and cruxes on the topic include (See section Critiques and cruxes based on the comments on the post and Comments by MIRI):

  • Timelines

  • Current models are not relevant/​not superhuman in a relevant sense

  • Does not address deception

  • May not be neglected due to economic incentives

  • Could we evaluate concrete conceptual proposals instead (such as Automated auditing)

The post has been well-received in the community. Open Philanthropy has since accepted a round of grant proposals. OpenAI, Redwood Research, Anthropic and Ought are already doing related work.

Example tasks

From Ajeya’s proposal in Open Philantropy’s Request for Proposals include:

  • How could human evaluators that are unfamiliar with a certain subject (like computer science or economics) effectively give feedback that incentivizes a model to accurately explain things about the subject?

  • How could human evaluators effectively provide feedback to an RL agent acting in a virtual environment that is partially occluded from the evaluators, or operates based on internal dynamics that the evaluators don’t understand?

    • [Imo, it would be also interesting to take an agent playing let’s say StarCraft, and make it play in a certain way, maybe even only from just a text description.]

  • How could human evaluators effectively give feedback that incentivizes a model to accurately translate between English and a foreign language that the evaluators don’t understand?

  • How would a model behave if it’s trained using a combination of an outcome metric and a human feedback signal? What happens if the outcome metric (e.g. “getting the most money in a negotiation game”) incentivizes doing something undesired (e.g. “lying”) that the human evaluators are unable to detect consistently?

Ajeya argues that the necessity to use humans and a “more useful” model at the end are indicators of an interesting task.

I personally would add a few ideas:

  • For a given fuzzy task, try to build a model that not just solves it, but comments on how it arrived at the solution, or why it thinks it is the right one. (For example, build a sentiment classifier using GPT-3, which can also reason about what its choice relied on.) As a particularly interesting subproblem we could try to build a model that has some “self-introspection” capabilities, and could correctly predict what change of inputs would change its decision/​outputs. I think MIRI’s Visible Thoughts project is somewhat related to this idea.

  • In some tasks, there might be side effects of the solution that the human is not aware of, but would consider important. Can we get models to tell us about these important properties, even if we did not explicitly asked for them? For example, imagine having asked Codex to generate a website with a particular design—perhaps it solves the task “correctly”, but the website has a security vulnerability. If Codex in some way “knows” about this information latently, it would be useful for us to have it suggest this information, even if we did not ask for it—it should just understand that that is part of our reward model.

Existing work

There is some already existing work in this general direction, most notably OpenAI using human feedback to finetune GPT-3 for summarisation, web search and following instructions via reward modelling. Anthropic recently published a similar paper, but on a more diverse set of tasks.

Redwood research is working on a concrete problem in this space, namely getting a large language model to output stories which never include someone getting injured.

Possible approaches to alignment

In the original post, what is considered alignment is intentionally left very open-ended, as part of the point is exploring promising proposals.

In order to make the research useful in the long-term, projects should strive to be:

  • General: Instead of using some task specific hacks, they should find task-agnostic generally working methods

  • Scalable: Make the most out of human feedback, and make it not prohibitively costly to use for more difficult tasks.

Some of the ideas worth exploring include:

  • Creating helper models which break down a question into subproblems

  • Reward modelling (and perhaps recursive reward modelling)

  • Implementing debate (See AI Safety via Debate)

  • Training on different amount of human thoughtfulness and generalising further

  • Training helper models that aid human thinking (and evaluation)

  • Figuring out smart ways to select and aggregate evaluations

In addition to these, [in my opinion] it could be interesting to make the feedback process more interactive, informative and dynamic.

In the above mentioned OpenAI articles, feedback is gathered by asking human labelers to compare two particular solutions, then training a reward model on the comparisons and finetuning the original model against the reward model.

I would be excited to see approaches improving on this by:

  • Letting the model query feedback multiple times in solving a particular task, similarly to an assistance game. In current approaches, feedback is gathered by the outer training algorithm, and the model basically just sees it as training data. However, for more complex tasks, asking for feedback in the process of solving the task might be useful, or iterating over a solution with improvements might be desired. (Imagine asking Codex to write a sorting function, but then asking it to use a different algorithm, a different sorting predicate, a different variable name.)

  • Gathering more informative feedback than just “binary better or worse”. Perhaps human labelers could provide textual evaluations, and we could teach our system to interpret that. (Ideally, this would mean humans could reward or discourage behaviors more granularly, such as saying “Write in a more positive tone”.)

  • Implementing ways the model could ask for a specific piece of information. In current approaches, it is usually the outer training algorithm deciding which examples to show to humans, or what feedback to get. In a more iterative setup, the model could ask for specific questions regarding the solution. (For example, in writing a story, it could ask “Can I use swear words?”). This seems somewhat related to “breaking down the question to subproblems”, but not exactly the same.

Test our approach: sandwiching

To test a given alignment approach in practice, we can try to establish a baseline performance using training data and feedback from “empowered” humans, and then trying to get as close as possible to that performance with “non-empowered” humans using and providing feedback to our model.

There are different ways we can make an “empowered” and “non-empowered” set of humans:

  • A set of humans can have field expertise

  • A set of humans can have access to more resources/​information (more time thinking/​consulting the Internet/​consulting a helper model)

  • A set of humans can achieve better performance via discussion

Critiques and cruxes

This particular summary of the discussion reflects how I interpreted arguments, and some of them are my own.

How relevant is currently available “narrow alignment work” to TAI alignment

Most of the benefits of this line of work hinge on how good “training grounds” are aligning current models to aligning more powerful models. In my opinion the following factors, if true, would make the case especially strong:

Short timelines

If something like the scaling hypothesis works, and we can use fairly similar approaches to current ones to reach transformative AI, then

  • We have less time to come up with a working solution to alignment, and need very tight feedback loops to progress on conceptual research

  • Working with current models is more likely to teach us valuable lessons, as current models are similar to really powerful ones

  • It seems even more important to align each of the models we build, because we can expect them to be more powerful, and have greater impact

Current models are already narrowly superhuman in a relevant sense

Fixing a broken pocket calculator (which is undoubtedly superhuman), or even making Google output better results isn’t traditionally considered a field of AI Alignment. Some people view GPT-3 as more of a fact-retrieval engine. If that is the case, maybe “alignment work” on GPT-3 is more similar to working on Google Search. Or, perhaps, as a less theoretical example, one could argue that transformative AI is likely to be agentic, and since language models are not agentic, a number of issues (such as inner alignment) are just not applicable. (The proposal does not limit itself to language models, but this counterargument applies to other models as well.)

That being said, models such as GPT-3 or AlphaZero might have superhuman qualities that are relevant, such as a superhumanly rich latent world model, or superhuman ability to “evaluate” multiple plans/​concepts. (Many more than a human can hold in working memory.)

No phase-changes

While sandwiching is an interesting idea, Eliezer advocates for caution: maybe we can align current weak models with “weak” humans using some clever techniques, but there is a chance they break down once the model “figures out how to manipulate humans”/​”hack itself”/​”changing its own environment”/​”is just optimising stronger”.

Even if future models are going to be similar to current ones in architecture, if simply stronger capabilities introduce some entirely new type of risks, this line of work is less valuable. Deception seems like a very good example, which we can’t experience with current models.

This is a subset of the previous point, but an important one.

Transparency tools are possible

If we had strong transparency tools, a lot of suggested concrete avenues would open for alignment work: we could provide better feedback to our models, verify their inner alignment (in particular, prevent deception) and audit (see Automated Auditing) them for robustness failures. If we have a chance at getting better at these techniques, concrete research in this direction involving current models is very valuable.


One could argue that this would be done by industry anyways, as they have a clear incentive to make their models more useful.

Ajeya counters that while there is related work, it is not exactly aimed at solving alignment in the long-term. (So for example they would not evaluate their approaches through sandwiching, or would cut corners and choose hacky solutions instead of general ones.)

In addition, pushing for more human-feedback research could, by setting a precedent and demonstrating that some approaches work:

  • Speed up the field, and make industry perhaps more focused on aligning their solutions (since there are demonstrated solutions)

  • Cheap alignment solutions can also allow smaller labs to start work in a similar area

  • Set the norms in the field, setting “good practices” (e.g. everyone has to align their models)

Open-ended alignment research vs evaluating concrete proposals

It has been suggested that we could instead evaluate concrete conceptual proposals, such as ascription universality, automated auditing, making models more honest, HCH or Debate.

To the extent that these proposals are already implementable, they indeed seem like very good proposals for “aligning narrowly superhuman models”. However, it might be the case, that these proposals are hard to implement precisely without direct guidance from a conceptual researcher like Paul Christiano.

It also strikes me as a strong argument, that conceptual work can benefit from trying out things in practice, and we might discover new approaches previously not considered. (Thus allowing more open-ended exploration makes sense.)

No comments.