Conversation with Eliezer: What do you want the system to do?
This is a write-up of a conversation I overheard between Eliezer and some junior alignment researchers. Eliezer reviewed this and gave me permission to post this, but he mentioned that “there’s a lot of stuff that didn’t get captured well or accurately.” I’m posting it under the belief that it’s better than nothing.
TLDR: People often work on alignment proposals without having a clear idea of what they actually want an aligned system to do. Eliezer thinks this is bad. He claims that people should start with the target (what do you want the system to do?) before getting into the mechanics (how are you going to get the system to do this?)
I recently listened in on a conversation between Eliezer and a few junior alignment researchers (let’s collectively refer to them as Bob). This is a paraphrased/editorialized version of that conversation.
Bob: Let’s suppose we had a perfect solution to outer alignment. I have this idea for how we could solve inner alignment! First, we could get a human-level oracle AI. Then, we could get the oracle AI to build a human-level agent through hardcoded optimization. And then--
Eliezer: What do you want the system to do?
Bob: Oh, well, I want it to avoid becoming a mesa-optimizer. And you see, the way we do this, assuming we have a perfect solution to outer alignment is--
Eliezer: No. What do you want the system to do? Don’t tell me about the mechanics of the system. Don’t tell me about how you’re going to train it. Tell me about what you want it to do.
Bob: What… what I want it to do. Well, uh, I want it to not kill us and I want it to be aligned with our values.
Eliezer: Aligned with our values? What does that mean? What will you actually have this system do to make sure we don’t die? Does it have to do with GPUs? Does it have to do with politics? Tell me what, specifically, you want the system to do.
Bob: Well wait, what if we just had the system find out what to do on its own?
Eliezer: Oh okay, so we’re going to train a superintelligent system and give it complete freedom over what it’s supposed to do, and then we’re going to hope it doesn’t kill us?
Bob: Well, um….
Eliezer: You’re not the only one who has trouble with this question. A lot of people find it easier to think about the mechanics of these systems. Oh, if we just tweak the system in these ways—look! We’ve made progress!
It’s much harder to ask yourself, seriously, what are you actually trying to get the system to do? This is hard because we don’t have good answers. This is hard because a lot of the answers make us uncomfortable. This is hard because we have to confront the fact that we don’t currently have a solution.
This happens with start-ups as well. You’ll talk to a start-up founder and they’ll be extremely excited about their database, or their engine, or their code. And then you’ll say “cool, but who’s your customer?”
And they’ll stare back at you, stunned. And then they’ll say “no, I don’t think you get it! Look at this—we have this state-of-the-art technique! Look at what it can do!”
And then I ask again, “yes, great, but who is your customer?”
With AI safety proposals, I first want to know who your customer is. What is it that you actually want your system to be able to do in the real-world? After you have specified your target, you can tell me about the mechanics, the training procedures, and the state-of-the-art techniques. But first, we need a target worth aiming for.
Questions that a curious reader might have, which are not covered in this post:
Why does Eliezer believe this?
Is it never useful to have a better understanding of the mechanics, even if we don’t have a clear target in mind?
Do the mechanics of the system always depend on its target? Or are there some improvements in mechanics that could be robustly good across many (or all) targets?
After the conversation, Eliezer mentioned that he often finds himself repeating this when he hears about alignment ideas. I asked him if he had written this up. He said no, but maybe someone like me should write it up.