This post describes eleven “full” AI alignment proposals (where the goal is to build a powerful, beneficial Ai system using current techniques), and evaluates them on four axes:
1. **Outer alignment:** Would the optimal policy for the specified loss function be aligned with us? See also this post . 2. **Inner alignment:** Will the model that is _actually produced_ by the training process be aligned with us? 3. **Training competitiveness:** Is this an efficient way to train a powerful AI system? More concretely, if one team had a “reasonable lead” over other teams, would they keep at least some of the lead if they used this algorithm? 4. **Performance competitiveness:** Will the trained model have good performance (relative to other models that could be trained)?
Seven of the eleven proposals are of the form “recursive outer alignment technique” plus “<@technique for robustness@>(@Worst-case guarantees (Revisited)@)”. The recursive outer alignment technique is either <@debate@>(@AI safety via debate@), <@recursive reward modeling@>(@Scalable agent alignment via reward modeling@), or some flavor of <@amplification@>(@Capability amplification@). The technique for robustness is either transparency tools to “peer inside the model”, <@relaxed adversarial training@>(@Relaxed adversarial training for inner alignment@), or intermittent oversight by a competent supervisor. An additional two proposals are of the form “non-recursive outer alignment technique” plus “technique for robustness”—the non-recursive techniques are vanilla reinforcement learning in a multiagent environment, and narrow reward learning.
Another proposal is Microscope AI, in which we train AI systems to simply understand vast quantities of data, and then by peering into the AI system we can learn the insights that the AI system learned, leading to a lot of value. We wouldn’t have the AI system act in the world, thus eliminating a large swath of potential bad outcomes. Finally, we have STEM AI, where we try to build an AI system that operates in a sandbox and is very good at science and engineering, but doesn’t know much about humans. Intuitively, such a system would be very unlikely to deceive us (and probably would be incapable of doing so).
The post contains a lot of additional content that I didn’t do justice to in this summary. In particular, I’ve said nothing about the analysis of each of these proposals on the four axes listed above; the full post talks about all 44 combinations.
Planned opinion:
I’m glad this post exists: while most of the specific proposals could be found by patching together content spread across other blog posts, there was a severe lack of a single article laying out a full picture for even one proposal, let alone all eleven in this post.
I usually don’t think about outer alignment as what happens with optimal policies, as assumed in this post—when you’re talking about loss functions _in the real world_ (as I think this post is trying to do), _optimal_ behavior can be weird and unintuitive, in ways that may not actually matter. For example, arguably for any loss function, the optimal policy is to hack the loss function so that it always outputs zero (or perhaps negative infinity).
Planned summary for the Alignment Newsletter:
Planned opinion:
“I usually don’t think about outer amplification as what happens with optimal policies”
Do you mean outer alignment?
Yup, thanks, edited