Corrigibility or DWIM is an attractive primary goal for AGI
While rereading the List of Lethalities (LoL), I was compelled by the argument against corrigibility. It’s really hard to make a goal of “maximize X, except if someone tells you to shut down”. I think the same argument applies to Christiano’s goal of achieving corrigibility through RL by rewarding correlates of corrigibility. If other things are rewarded more reliably, you may not get your AGI to shut down when you need it to.
But those arguments don’t apply if corrigibility in the broad sense is the primary goal. “Doing what this guy means by what he says” is a perfectly coherent goal. And it’s a highly attractive one, for a few reasons. Perhaps corrigibility shouldn’t be used in this sense and do what I mean (DWIM) is a better term. But it’s closely related. It accomplishes corrigibility, and has other advantages. I think it’s fairly likely to be the first goal someone actually gives an AGI.
“Do what I mean” sidesteps the difficulty of outer alignment. The difficulty of outer alignment is another point in the LoL. One common plan, which seems sensible, is to keep humans in the loop; to have a Long Reflection to decide what we want. “DWIM” allows you to contemplate and change your mind as much as you like.
Of course, the problem here is: do what WHO means? We’d like an AGI that serves all of humanity, not just one guy or board of directors. And we’d like to not have power struggles.
But from the point of view of a team actually deciding what goal to give their shot at AGI, DWIM will be incredibly attractive for practical reasons. The outer alignment problem is hard. Specifying one person (or a few) to take instructions from is vastly simpler than deciding and specifying a goal that captures all of human flourishing for all time. You don’t want to trust an AGI to interpret that goal correctly. Intepreting DWIM is still fraught, but it is naturally self-correcting, and becomes more useful as the AGI gets more capable. A smarter AGI will be better at understanding what you probably mean, and better at realizing when it’s not sure what you mean so it can ask for clarification.
This doesn’t at all address inner alignment. But when somebody thinks they have good-enough inner alignment to launch a goal-directed, sapient AGI, DWIM is likely to be the goal they’ll choose. This could be good or bad, depending on how well they’ve implemented inner alignment, and what type of people they are.
- 0. CAST: Corrigibility as Singular Target by 7 Jun 2024 22:29 UTC; 137 points) (
- 2. Corrigibility Intuition by 8 Jun 2024 15:52 UTC; 65 points) (
- 4. Existing Writing on Corrigibility by 10 Jun 2024 14:08 UTC; 47 points) (
- Conflating value alignment and intent alignment is causing confusion by 5 Sep 2024 16:39 UTC; 45 points) (
- Goals selected from learned knowledge: an alternative to RL alignment by 15 Jan 2024 21:52 UTC; 41 points) (
- 14 Sep 2024 17:02 UTC; 33 points) 's comment on AGI Ruin: A List of Lethalities by (
- My disagreements with “AGI ruin: A List of Lethalities” by 15 Sep 2024 17:22 UTC; 33 points) (
- My disagreements with “AGI ruin: A List of Lethalities” by 15 Sep 2024 17:22 UTC; 16 points) (EA Forum;
- After Alignment — Dialogue between RogerDearnaley and Seth Herd by 2 Dec 2023 6:03 UTC; 15 points) (
- Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis by 1 Feb 2024 21:15 UTC; 13 points) (
- 21 Feb 2024 5:03 UTC; 11 points) 's comment on The Shutdown Problem: An AI Engineering Puzzle for Decision Theorists by (
- 22 Dec 2023 21:35 UTC; 10 points) 's comment on On the future of language models by (
- 1 May 2024 22:52 UTC; 7 points) 's comment on Shane Legg’s necessary properties for every AGI Safety plan by (
- 22 Jan 2024 21:19 UTC; 4 points) 's comment on A Shutdown Problem Proposal by (
- 28 Nov 2023 22:43 UTC; 3 points) 's comment on LLMs May Find It Hard to FOOM by (
- 31 Jan 2024 5:07 UTC; 2 points) 's comment on Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI by (
- 23 Jan 2024 19:38 UTC; 2 points) 's comment on A Shutdown Problem Proposal by (
- 1 Aug 2024 23:18 UTC; 2 points) 's comment on Simplifying Corrigibility – Subagent Corrigibility Is Not Anti-Natural by (
- 19 Jan 2024 2:56 UTC; 2 points) 's comment on The alignment stability problem by (
Agree, and I’ve had similar/related thoughts on how DWIM seems like a pretty natural target for LLM alignment: https://www.lesswrong.com/posts/wr2SxQuRvcXeDBbNZ/bogdan-ionut-cirstea-s-shortform?commentId=65czxJGyBuhqhBRex https://www.lesswrong.com/posts/wr2SxQuRvcXeDBbNZ/bogdan-ionut-cirstea-s-shortform?commentId=GRjfMwLDFgw6qLnDv
Thanks! This seems pretty obvious, from this perspective, right? But there’s a lot of concern that outer alignment being hard makes the alignment problem much harder. It seems like you can easily just punt on outer alignment, so I think it’s very likely that’s what people will do.
Have you looked into Value Learning? It’s basically “figure out what we mean, then do it”
I hadn’t seen value learning, thank you! I am familiar with Stuart Russel’s inverse reinforcement learning, which I think is very similar, and closer to a implementable proposal. I am not enthusiastic about IRL. The proposal there is to infer a human’s value function from their behavior, or from the behavior they reward in their agents. To me this seems like a very clumsy solution relative to asking the human what they want when it’s unclear and the consequences are important. That’s what I’m proposing is the obvious and simple approach that will likely be tried. That could be coupled with IRL.
My mental model here is not “figure out what we mean, then do it”, but “infer what I mean based on your models of human language, then check with me if your estimate of consequences are past this threshold I set, or if you have conflicting models of what I might mean”. You probably would want some cumulative learning of likely intentions, but you would not want to relax the criteria for checking before executing consequential plans by very much.
IRL or other value learning alone puts the weight of understanding human ethics/value function on the AI system. Even if it works, current human ethics/value functions might be an extraordinarily bad outer alignment target. It could be that maximizing our revealed preference leads to all-against-all competition or war, or the elimination of humanity in favor of better fits to our inferred value function. We don’t know what we want, so we don’t know what we’d get from having an AGI figure out what we really want. See Moral Reality Check (a short story) and my comment on it. So I’d prefer we figure out what we want for ourselves, and I think that’s going to be a very common motivation among humans. The “long contemplation” suggestion appears to be a common one among people thinking about outer alignment targets.