Shane Legg interview on alignment

Link post

This is Shane Legg, cofounder of DeepMind, on Dwarkesh Patel’s podcast. The link is to the ten-minute section in which they specifically discuss alignment. Both of them seem to have a firm grasp on alignment issues as they’re discussed on LessWrong.

For me, this is a significant update on the alignment thinking of current leading AGI labs. This seems more like a concrete alignment proposal than we’ve heard from OpenAI or Anthropic. Shane Legg has always been interested in alignment and a believer in X-risks. I think he’s likely to play a major role in alignment efforts at DeepMind/​Google AI as they approach AGI.

Shane’s proposal centers on “deliberative dialogues”, DeepMind’s term for a system using System 2 type reasoning to reflect on the ethics of the actions it’s considering.

This sounds exactly like the the internal review I proposed in Capabilities and alignment of LLM cognitive architectures and Internal independent review for language model agent alignment. I could be squinting too hard to get his ideas to match mine, but they’re at least in the same ballpark. He’s proposing a multi-layered approach, like I do, and with most of the same layers. He includes RLHF or RLAIF as useful additions but not full solutions, and human review of its decision processes (externalized reasoning oversight as proposed by Tamera Lanham, now at Anthropic).

My proposals are explicitly in the context of language model agents, (including their generalization to multimodal foundation models). It sounds to me like this is the type of system Shane is thinking of when he’s talking about alignment, but here I could easily be projecting. His timelines are still short, though, so I doubt he’s envisioning a whole new type of system prior to AGI.[1]

Dwarkesh pushes him on the challenges of both getting a ML system to understand human ethics. Shane says that’s challenging; he’s aware that giving a system any ethical outlook at all is nontrivial. I’d say this aspect of the problem is well on the way to being solved; GPT4 understands a variety of human ethical systems rather well, with proper prompting. Future systems will understand human conceptions of ethics better yet. Shane recognizes that just teaching a system about human ethics isn’t enough; there’s a philosophical challenge in choosing the subset of that ethics you want the system to use.

Dwarkesh also pushes him on how you’d ensure that the system actually follows its ethical understanding. I didn’t get a clear understanding from his answer here, but I think it’s a complex matter of designing the system so that it performs an ethics review and then actually uses it to select actions. This could be in a scripted scaffold around an agent, like AutoGPT, but this could also apply to more complex schemes, like an RL outer loop network running a foundation model. Shane notes the problems with using RL for alignment, including deceptive alignment.

This seems like a good starting point to me, obviously; I’m delighted to see that someone whose opinion matters is thinking about this approach. I think this is not just an actual proposal, but a viable one. It doesn’t solve The alignment stability problem[2] of making sure stays aligned once it’s autonomous and self-modifying, but I think that’s probably solvable, too, once we get some more thinking on it.

The rest of the interview is of interest as well; it’s Shane’s thoughts on the path to AGI, which I think is quite reasonable, well-expressed, and one plausible path; DeepMind’s contributions to safety vs. alignment, and his predictions for the future.

  1. ^

    When asked about the limitations of language models relative to humans, he focused on their lack of episodic memory. Adding this in useful form to an agent isn’t trivial, but it seems to me it doesn’t require any breakthroughs relative to the vector databases and knowledge graph approaches already in use. This is consistent with but not strong evidence for Shane thinking that foundation model agents are the path to AGI.

  2. ^

    Edit: Value systematization: how values become coherent (and misaligned) is another way to think about part of what I’m calling the alignment stability problem.