One argument against work on AI safety is that [it is hard to do good work without feedback loops](https://www.jefftk.com/p/why-global-poverty). So how could we get feedback loops? The most obvious approach is to actually try to align strong models right now, in order to get practice with aligning models in the future. This post fleshes out what such an approach might look like. Note that I will not be covering all of the points mentioned in the post; if you find yourself skeptical you may want to read the full post as your question might be answered there.
The author specifically suggests that we work on **aligning narrowly superhuman models** to make them more useful. _Aligning_ a model roughly means harnessing the full capabilities of the model and orienting these full capabilities towards helping humans. For example, GPT-3 presumably “knows” a lot about medicine and health. How can we get GPT-3 to apply this knowledge as best as possible to be maximally useful in answering user questions about health?
_Narrowly superhuman_ means that the model has more knowledge or “latent capability” than either its overseers or its users. In the example above, GPT-3 almost certainly has more medical knowledge than laypeople, so it is at least narrowly superhuman at “giving medical advice” relative to laypeople. (It might even be so relative to doctors, given how broad its knowledge is.)
<@Learning to Summarize with Human Feedback@> is a good example of what this could look like: that paper attempted to “bring out” GPT-3’s latent capability to write summaries, and outperformed the reference summaries written by humans. This sort of work will be needed for any new powerful model we train, and so it has a lot of potential for growing the field of people concerned about long-term risk.
Note that the focus here is on aligning _existing_ capabilities to make a model more useful, and so simply increasing capabilities doesn’t count. As a concrete example, just scaling up the model capacity or training data or compute would _not_ count as an example of “aligning narrowly superhuman models”, even though it might make the model more useful, since scaling increases raw capabilities without improving alignment. This makes it pretty different from what profit-maximizing companies would do by default: instead of baking in domain knowledge and simply scaling up models in order to solve the easiest profitable problems (as you would do if you wanted to maximize profit), work in this research area would look for general and scalable techniques, would not be allowed to scale up models, and would select interestingly difficult problems.
Why is this a fruitful area of research? The author points out four main benefits: 1. Most importantly, the more we align systems ahead of time, the more likely that researchers will be able to put thought and consideration into new issues like treacherous turns, rather than spending all their time putting out fires. 2. We can build practical know-how and infrastructure for alignment techniques like learning from human feedback. 3. As the world gets progressively faster and crazier, we’ll have better AI assistants helping us to navigate the world. 4. It improves our chances of discovering or verifying a long-term or “full” alignment solution.
I am very sympathetic to the argument that we should be getting experience with aligning powerful models right now, and would be excited to see more work along these lines. As the post mentions, I personally see this sort of work as a strong baseline, and while I currently think that the conceptual work I’m doing is more important, I wouldn’t be surprised if I worked on a project in this vein within the next two years.
Planned summary for the Alignment Newsletter:
Planned opinion: