Thanks Evan for an excellent overview of the alignment problem as seen from within Anthropic. Chris Olah’s graph showing perspectives on alignment difficulty is indeed a useful visual for this discussion. Another image I’ve shared lately relates to the challenge of building inner alignment illustrated by figure 2 from Contemplative Artificial Intelligence:
In the images, the blue arrows indicate our efforts to maintain alignment on AI as capabilities advance through AGI to ASI. In the left image, we see the case of models that generalize in misaligned ways—the blue constraints (guardrails, system prompts, thin efforts at RLHF, etc) fail to constrain the advancing AI to remain aligned. The right image shows the happier result where training, architecture, interpretability, scalable oversight, etc contribute to a ‘wise world model’ that maintains alignment even as capabilities advance.
I think the Anthropic probability distribution of alignment difficulty seems correct—we probably won’t get alignment by default from advancing AI, but, as you suggest, by serious concerted effort we can maintain alignment through AGI. What’s critical is to use techniques like generalization science, interpretability, and introspective honesty to gauge whether we are building towards AGI capable of safely automating alignment research towards ASI. To that end, metrics that allow us determine if alignment is actual closer to P-vs-NP in difficulty are crucial, and efforts from METR, UK AISI, NIST, and others can help here. I’d like to see more ‘positive alignment’ papers such as Cooperative Inverse Reinforcement Learning, Corrigibility, and AssistanceZero: Scalably Solving Assistance Games, as detecting when an AI is positively aligned internally is critical to getting to the ‘wise world model’ outcome.
Thanks Evan for an excellent overview of the alignment problem as seen from within Anthropic. Chris Olah’s graph showing perspectives on alignment difficulty is indeed a useful visual for this discussion. Another image I’ve shared lately relates to the challenge of building inner alignment illustrated by figure 2 from Contemplative Artificial Intelligence:
In the images, the blue arrows indicate our efforts to maintain alignment on AI as capabilities advance through AGI to ASI. In the left image, we see the case of models that generalize in misaligned ways—the blue constraints (guardrails, system prompts, thin efforts at RLHF, etc) fail to constrain the advancing AI to remain aligned. The right image shows the happier result where training, architecture, interpretability, scalable oversight, etc contribute to a ‘wise world model’ that maintains alignment even as capabilities advance.
I think the Anthropic probability distribution of alignment difficulty seems correct—we probably won’t get alignment by default from advancing AI, but, as you suggest, by serious concerted effort we can maintain alignment through AGI. What’s critical is to use techniques like generalization science, interpretability, and introspective honesty to gauge whether we are building towards AGI capable of safely automating alignment research towards ASI. To that end, metrics that allow us determine if alignment is actual closer to P-vs-NP in difficulty are crucial, and efforts from METR, UK AISI, NIST, and others can help here. I’d like to see more ‘positive alignment’ papers such as Cooperative Inverse Reinforcement Learning, Corrigibility, and AssistanceZero: Scalably Solving Assistance Games, as detecting when an AI is positively aligned internally is critical to getting to the ‘wise world model’ outcome.