A Review of Weak to Strong Generalization [AI Safety Camp]

Thank you to everyone in AI Safety Camp group 22 for the discussions and suggestions. In particular, thank you to: Bogdan Ionut Cirstea, Vassil Tashev and Jaeson Booker

Introduction

The goal of AI Safety Camp Team #22 is to assess how promising automating alignment research is (see source 23). We have decomposed the problem into various sub-problems. One of my teammates (Vassil Tashev) and I have focused a couple weeks of reading and discussion on weak to strong generalization. This is a research direction that the OpenAI super alignment team explored in their first paper, which they published in December of 2023. Here we present a comprehensive review and reading list for weak to strong generalization. Our aim is to assess whether this research direction is promising towards the goal of creating a roughly human level, aligned, automated alignment researcher (source 16) - this appears to be OpenAI’s super alignment team’s alignment plan (source 25). We believe this may be the most scalable alignment research direction.

The Problem

Current alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on human feedback. This will break down when we try to align models more capable than humans, because the human feedback data to draw on will be poor. Humans will have difficulty robustly evaluating the model’s responses because strong capabilities are more difficult to evaluate than subhuman capabilities. Imagine evaluating whether thousands of lines of code that a possibly superhuman model has written, and rating whether the model has done as well as it could.

As an analogy, consider what might happen were one to hand an 17th century engineer 4 alternative schematics. The engineer is instructed to select (thumbs up) the machine that cools air, where one is a modern air conditioner and the other three are technical diagrams for heating devices. The 17th century engineer lacks the knowledge to understand refrigeration or the temperature-pressure relation (a discovery made in 1802) and might select the cooling machine about as well as a chimpanzee or dog (25% of the time). Just as our current scientific knowledge exceeds that of a 17th century person, an agent more capable than humans could search for strategies through causal domains the human does not currently model. (source 20)

By definition, we do not have ground truth labels to questions we have yet to answer. If one of our more capable models gave us a program beyond the understanding of humanity’s greatest software engineers, how could we ensure that the program does what we’re interested in? How can we supervise systems above human level [on some task] when we have difficulty determining the correctness of model outputs?

As discussed in (source 11), we have two orthogonal classes of approach to this problem:

Scalable oversight (SO), or making humans better evaluators. The goal here is to make the supervision signal stronger.
Weak to strong generalization (W2SG), or making models that generalize better from our weak / imperfect labels.
Various combinations of (1) and (2)’s variations (debate + W2SG, task decomposition + SO, SO on policies learned with W2SG, etc) (source 15)

Here we focus on approach #2 with regards to automating alignment research.

Reasons for Optimism

In the past two months, there have been significant empirical advances:
1. OpenAI’s super alignment team’s paper (source 1) defined the weak to strong learning problem: is it possible to elicit the full capabilities of a more capable model with weak supervision? The experiment in this paper involves finetuning large (strong) pretrained models to generalize well from supervision by a smaller, less accurate (weak) model. In some ways, this setup is analogous to the case where humans are the weak supervisors and more capable “superhuman models” are the strong students. We believe the most impressive result is using a GPT-2-level model to elicit most of GPT-4’s capabilities (~GPT-3.5-level performance). The stronger model generalized correctly even on questions the weak model failed at. It is not obvious why this should be the case, as models are empirically imitating their training data. While this result is interesting, our takeaway from this paper was that the empirical results alone are not groundbreaking. OpenAI includes a useful list of disanalogies, limitations of their current setup, and ways that it fails to resemble super alignment. For instance, more powerful models in the future may be better at imitation and reproduce incorrect labels. However, this paper introduced the weak to strong learning problem and set the stage for iteration on eliciting strong capabilities with weak supervision. Also, this reframing of the alignment problem is tractable and can be empirically studied, whereas experiments cannot be done on a superhuman AI yet. OpenAI clearly demonstrated a case of weak reward signals eliciting strong capabilities. It is also clear that improving OpenAI’s results is tractable, especially with creative engineering solutions (i.e. similar to OpenAI’s auxiliary loss).
2. Source 2 focuses on easy-to-hard generalization, a special case of weak to strong generalization. This paper demonstrates that models can perform well on hard, domain specific questions when trained on easier questions. For example, a language model trained on 3rd grade questions scores almost as high on a college-level exam as a language model trained on college-level questions. These results are highly promising in eliciting strong capabilities out of superhuman models with weak examples. Our key takeaway is that with the correct engineering solutions, we may be able to elicit answers to problems we don’t know the answer to (but the superhuman model does) by using labels from problems we do know. Our key criticism, one which the authors acknowledge in their paper, is that a simple explanation of the results may be that easy data increases saliency of good results that are already known from pre-training. There is a disanalogy here—as the future models we are interested in eliciting capabilities will not be in the pretraining data. We may not get capability enhancements from fine-tuning or in-context learning on easy examples- instead we may just be activating the right pre-training knowledge better.
3. Source 3 presents “Vision Super Alignment”, using weak-to-strong generalization on pretrained vision models by introducing a new loss function. Their results exceed OpenAI’s weak-to-strong vision results. Similar to OpenAI’s “auxiliary loss” term, the paper introduces their own loss function enabling nuanced supervision that allows the strong student to prioritize its own predictions over the supervisors.
4. Source 7 approaches W2SG with debate. The authors have two instances of GPT-4 debating one another over a conclusion from a text. Here, a different, weaker language model is a “judge”. We conclude that the key point of this paper is to show that debate can improve eliciting knowledge from strong “student” models when the “supervisor” doesn’t have access to ground truth labels. Their key finding is that non-expert humans answer questions better after reading debates between expert LLMs, and training expert LLMs to be more persuasive improves results (judge accuracy). This paper also improves upon some of OpenAI’s initial results.
Automated alignment research is a more modest goal than coming up with a “once and for all solution to alignment”.
1. As pointed out in (source 16), creating automated alignment researchers doesn’t require generating solutions to core alignment challenges ourselves. We could focus on evaluating solutions.
2. Eliciting ideas out of LLM-like systems is approachable even if the LLM has generalized beyond human capabilities (W2SG results).
It is probable that solutions to core problems in alignment involve the type of knowledge that humans could produce. Empirical progress can be made within the distribution of existing human ideas. If you only need human-level alignment researchers for empirical progress, you shouldn’t have to generalize that far outside of the distribution of existing human alignment ideas to make empirical progress on alignment.
1. The way I look at this is: there are vastly more alignment proposals out there than you or I could read. An LLM agent can consider / is exposed to magnitudes more experiment ideas than a human. Perhaps there are gems buried within the mountains of social media comments suggesting alignment experiments. It makes much more sense for us to use automated researchers, which are scalable and parallelizable, to run these experiments, than to spend limited human scientist hours on them.
2. Our team coordinator, Bogdan, takes this further: Is this the type of knowledge that humans can produce? If the knowledge necessary for alignment solutions falls within the range of human capabilities that would make this approach more promising and it seems reasonable to believe that alignment progress can be made within the distribution of human knowledge. Better yet, if alignment solutions exist within our collective knowledge but are hindered by time needed for deliberation or experimentation—like 500+ scientists working for 10,000 years—then employing automated alignment researchers aligned with weak to strong generalization principles appears to be a highly promising approach. Conversely, if alignment is far beyond human comprehension and not time-limited, this approach may be less viable.
3. Another teammate, Jaeson, adds that a superintelligent AI will likely use non-human data (data outside of the distribution of internet text / unsupervised learning / alpha zero / mu zero type systems). The hope is that weak-to-strong can allow for a model that is doing some unsupervised learning to have its new data guided by an automated supervisor and have the unsupervised learning system generalize behavior, values and goals from that model.

Reasons for Pessimism

Implementing weak to strong generalization techniques on superhuman models requires a “leap of faith”. When we actually apply this approach to real life problems that we don’t have ground truth labels for, we are essentially taking a “leap of faith” that the superhuman model is telling us the truth. (source 22) Since we should expect superhuman models to understand when we’re taking the leap of faith, this is a particularly natural point for models to take a sharp left turn (where a model acts aligned or trustworthy until it is powerful enough to not need to).
Evaluation can be very hard, even if it is easier than generation. It is hard to evaluate ideas in science and in alignment.
1. For example, even the most capable AI researchers still cannot agree whether or not there are existential or catastrophic risks to ai research (source 21). Humans have difficulty in assessing complex issues.
2. The challenge of evaluating research is illustrated by our own team’s disagreements on the utility of various papers in our reading list.
A potential failure mode of some of the current literature is pre training leakage.
1. As an example, in the Allen Institute for AI easy-to-hard paper (source 2), easy data may just be better [than hard data] at eliciting knowledge [learned from pre-training] from a powerful model, as opposed to being for training. However, this is still an interesting and useful result.
2. As mentioned in the “reasons for optimism” point #3, if humans could solve alignment with 100,000 years and current knowledge/capabilities, this failure mode doesn’t matter. A time constraint on alignment research matters less if we can automate and parallelize the process of creating explanations and solutions.
Another potential failure mode of the current literature: imitation problems
1. Modern supervised learning is a form of empirical risk minimization.
  1. Given unlimited compute and data, supervised learning should eventually perfectly mimic its training data perfectly.
2. Imitating human failures (incorrect labels) is a problem that may get worse over time, not better, if models simply become better at imitating humans.
Generalization domain transfer has yet to be demonstrated
1. We believe that we should consider two types of generalization: difficulty generalization and domain generalization. Most of the current literature examines a weak training signal (easy data, smaller model, etc) being used to train a strong model
2. Can models generalize something like “truth”. Can we expect honest LLM answers without using interpretability? (related to source 24)

Future Work

Domain transfer generalization. I.e. generalizing concepts such as “truth”
1. The current literature (such as source 2) shows improvement in strong model performance with a weak reward signal within a particular domain (i.e. 3rd grade mathematics exam examples improving 12th grade exam responses)
Reproduce the easy-to-hard experiment (source 2) by training a weak supervisor from scratch on ground truth labels, then training a much larger model on its predictions. This overcomes concerns about pretraining leakage. This is because eliciting strong capabilities is an interesting result, but teaching strong capabilities with a weak signal would be even more interesting.
A goal we have is to have a stronger model generalizing goals, values, etc. This is not empirically demonstrated.

Conclusion

We think that the empirical results of the original weak to strong generalization paper by OpenAI are not promising on their own. The real utility of their initial paper is how the alignment problem is reframed into a tractable analogy. While this isn’t a solution to alignment, it is a problem that might have a technical solution—and working towards that technical solution might be straightforward. The OpenAI superalignment team laid the groundwork for empirical progress on weak to strong generalization, and a number of papers since have already improved their results. This is clearly a useful proof of concept. Multiple papers we have reviewed will be able to significantly improve automated alignment research. Furthermore, it would be a highly promising minimum viable product for alignment if automated alignment researchers were able to generalize concepts like “truth”. Finally, the usefulness of these results is highly dependent on what the first systems that can meaningfully contribute to alignment research look like—if they look like language models, this work has much higher utility.

In the worst case scenario for the W2SG and Easy-to-hard (E2HG) papers, the more capable models already know the answers to difficult questions (questions an easy supervisor struggles with) via pre-training leakage. The subsequent fine-tuning or in-context learning via weak supervision might simply make the concept we are interested in eliciting more “salient”. However, this is still a useful result that shows that weak models can elicit knowledge that the strong model already knows.

On the other hand, in the best case scenario strong models are able to generalize concepts like “truth” from data with known labels, and create labels for data where we do not know the labels.

It’s currently difficult to tell which explanation is more suitable for the literature on super alignment via weak to strong generalization. However in both cases the following results are clear:

Weak supervisors can elicit capabilities beyond their own, but not necessarily everything the stronger model knows
Improving weak to strong generalization seems tractable
Eliciting good answers to difficult alignment questions (ones where we do not have ground truth labels) might require using weak to strong generalization of some sort if we want our automated alignment researchers to share their (potentially superhuman) progress.