OpenAI’s strategy, as of the publication of that post, involved scalable alignment approaches. Their philosophy is to take an empirical and iterative approach[1] to finding solutions to the alignment problem. Their strategy for alignment is cyborgism, where they create AI models that are capable and aligned enough to further alignment research enough that they can align even more capable models.[2]
Their research focus is on scalable approaches to direct models[3]. This means that the core of their strategy involves RLHF. They don’t expect RLHF to be sufficient on its own, but it is necessary for the other scalable alignment strategies they are looking at[4].
They intend to augment RLHF with AI-assisted scaled up evaluation (ensuring RLHF isn’t bottlenecked by a lack of accurate evaluation data for tasks too onerous for baseline humans to evaluate performance for)[5].
Finally, they then intend to use these partially-aligned models to do alignment research, since they anticipate that alignment approaches that work and are viable for low capability models may not be adequate for models with higher capabilities.[6] They intend to use the AI-based evaluation tools to both RLHF-align models, and as part of a process where humans evaluate alignment research produced by these LLMs (here’s the cyborgism part of the strategy).[7]
Their “Limitations” section of their blog post does clearly point out the vulnerabilities in their strategy:
They ignore non-Godzilla strategies such as interpretability research and robustness (aka robustness to distribution shift and adverserial attacks—see Stephen Casper’s research for an idea about this), and they do intend to hire researchers so their portfolio includes investment in this research direction
They may be wrong about achieving the creation of AI models that are partially-aligned and help with alignment research but aren’t so capable that they can cause pivotal acts. If so, then the pivotal acts achieved will only be partially aligned to that of the AI wielder and will probably not lead to a good ending.
We take an iterative, empirical approach: by attempting to align highly capable AI systems, we can learn what works and what doesn’t, thus refining our ability to make AI systems safer and more aligned.
We believe that even without fundamentally new alignment ideas, we can likely build sufficiently aligned AI systems to substantially advance alignment research itself.
At a high-level, our approach to alignment research focuses on engineering a scalable training signal for very smart AI systems that is aligned with human intent.
We don’t expect RL from human feedback to be sufficient to align AGI, but it is a core building block for the scalable alignment proposals that we’re most excited about, and so it’s valuable to perfect this methodology.
RL from human feedback has a fundamental limitation: it assumes that humans can accurately evaluate the tasks our AI systems are doing. Today humans are pretty good at this, but as models become more capable, they will be able to do tasks that are much harder for humans to evaluate (e.g. finding all the flaws in a large codebase or a scientific paper). Our models might learn to tell our human evaluators what they want to hear instead of telling them the truth.
There is currently no known indefinitely scalable solution to the alignment problem. As AI progress continues, we expect to encounter a number of new alignment problems that we don’t observe yet in current systems. Some of these problems we anticipate now and some of them will be entirely new.
We believe that finding an indefinitely scalable solution is likely very difficult. Instead, we aim for a more pragmatic approach: building and aligning a system that can make faster and better alignment research progress than humans can.
We believe that evaluating alignment research is substantially easier than producing it, especially when provided with evaluation assistance. Therefore human researchers will focus more and more of their effort on reviewing alignment research done by AI systems instead of generating this research by themselves. Our goal is to train models to be so aligned that we can off-load almost all of the cognitive labor required for alignment research.
Sidenote: I like how OpenAI ends their blog posts with an advertisement for positions they are hiring for, or programs they are running. That’s a great strategy to advertise to the very people they want to reach.
2022-08; Jan Leike, John Schulman, Jeffrey Wu; Our Approach to Alignment Research
OpenAI’s strategy, as of the publication of that post, involved scalable alignment approaches. Their philosophy is to take an empirical and iterative approach[1] to finding solutions to the alignment problem. Their strategy for alignment is cyborgism, where they create AI models that are capable and aligned enough to further alignment research enough that they can align even more capable models.[2]
Their research focus is on scalable approaches to direct models[3]. This means that the core of their strategy involves RLHF. They don’t expect RLHF to be sufficient on its own, but it is necessary for the other scalable alignment strategies they are looking at[4].
They intend to augment RLHF with AI-assisted scaled up evaluation (ensuring RLHF isn’t bottlenecked by a lack of accurate evaluation data for tasks too onerous for baseline humans to evaluate performance for)[5].
Finally, they then intend to use these partially-aligned models to do alignment research, since they anticipate that alignment approaches that work and are viable for low capability models may not be adequate for models with higher capabilities.[6] They intend to use the AI-based evaluation tools to both RLHF-align models, and as part of a process where humans evaluate alignment research produced by these LLMs (here’s the cyborgism part of the strategy).[7]
Their “Limitations” section of their blog post does clearly point out the vulnerabilities in their strategy:
Their strategies involve using one black box (scalable evaluation models) to align another black box (large LLMs being RLHF-aligned), a strategy I am pessimistic about, although it probably is good enough given low enough capability models
They ignore non-Godzilla strategies such as interpretability research and robustness (aka robustness to distribution shift and adverserial attacks—see Stephen Casper’s research for an idea about this), and they do intend to hire researchers so their portfolio includes investment in this research direction
They may be wrong about achieving the creation of AI models that are partially-aligned and help with alignment research but aren’t so capable that they can cause pivotal acts. If so, then the pivotal acts achieved will only be partially aligned to that of the AI wielder and will probably not lead to a good ending.
Sidenote: I like how OpenAI ends their blog posts with an advertisement for positions they are hiring for, or programs they are running. That’s a great strategy to advertise to the very people they want to reach.