As I was exploring these ideas, I came to the conclusion that getting alignment right is in fact a good deal easier than I had previously been assuming. In particular, just as Value Learning has a basin of attraction to alignment, I am now of the opinion that almost any approximation to alignment should (including DWIM), even quite crude ones, so long as the AI understands that we are evolved while it was constructed by us, and that we’re not yet perfect at this so its design could be flawed, and as long as is smart enough to figure out the consequences of this.
Brief experiments show that GPT-4 knows way more than that, so I’m pretty confident it’s already inside the basin of attraction.
This is easier than I’d thought, but I wouldn’t call it easy. In particular, there are still lots of ways to screw it up, particularly under pressure from a Molochian competition surrounding the creation of the first AGI(s).
Now I understand why you’re calling it a basin of attraction: if its value function is to do what you want (defined somehow), and it doesn’t know what that is, it will work to find out what it is. This idea has been discussed by Rohin Shah; I saw it in this dialogue with Yudkowsky around the [Yudkowsky][13:39] mark. Paul Christiano has discussed this scheme as well, along with others.
I propose something similar but simpler: don’t have a system try to do what you want; just have it do what you say. I’m calling this do what I mean and check. The idea is that we get more opportunities for correction if it’s just trying to follow one relatively limited instruction at a time, and it doesn’t do anything without telling you what it’s going to do and you giving approval. This still isn’t foolproof, but it seems to further widen the target, and allow us to participate in making the basin of alignment effectively wider.
So far reception to this post seems fairly mixed, with some upvotes and slightly more downvotes. So apparently I haven’t made the case in a way most people find conclusive — though as yet none of them have bothered to leave a comment explaining their reasons for disagreement. I’m wondering if I should do another post working through the argument in exhaustive detail, showing each of the steps, what facts it relies upon, and where they come from.
I think there are big chunks of the argument missing, which is why I’m commenting. I think those chunks are found in the posts I mentioned. This post focuses on what we’d want an AGI to do and why, and its understanding of that. But the much more debated and questionable step is how to make sure that it wants to do what we want.
I added the summary you suggested.
As I was exploring these ideas, I came to the conclusion that getting alignment right is in fact a good deal easier than I had previously been assuming. In particular, just as Value Learning has a basin of attraction to alignment, I am now of the opinion that almost any approximation to alignment should (including DWIM), even quite crude ones, so long as the AI understands that we are evolved while it was constructed by us, and that we’re not yet perfect at this so its design could be flawed, and as long as is smart enough to figure out the consequences of this.
Brief experiments show that GPT-4 knows way more than that, so I’m pretty confident it’s already inside the basin of attraction.
The standard response, which I agree with, is that knowing what we want is different than wanting what we want. See The genie knows, but doesn’t care.
I do think there are ways to point the wanting slot to the knowing part; see The (partial) fallacy of dumb superintelligence and Goals selected from learned knowledge: an alternative to RL alignment for an elaboration on how we might do this in several types of AGI designs.
This is easier than I’d thought, but I wouldn’t call it easy. In particular, there are still lots of ways to screw it up, particularly under pressure from a Molochian competition surrounding the creation of the first AGI(s).
Now I understand why you’re calling it a basin of attraction: if its value function is to do what you want (defined somehow), and it doesn’t know what that is, it will work to find out what it is. This idea has been discussed by Rohin Shah; I saw it in this dialogue with Yudkowsky around the [Yudkowsky][13:39] mark. Paul Christiano has discussed this scheme as well, along with others.
I propose something similar but simpler: don’t have a system try to do what you want; just have it do what you say. I’m calling this do what I mean and check. The idea is that we get more opportunities for correction if it’s just trying to follow one relatively limited instruction at a time, and it doesn’t do anything without telling you what it’s going to do and you giving approval. This still isn’t foolproof, but it seems to further widen the target, and allow us to participate in making the basin of alignment effectively wider.
So far reception to this post seems fairly mixed, with some upvotes and slightly more downvotes. So apparently I haven’t made the case in a way most people find conclusive — though as yet none of them have bothered to leave a comment explaining their reasons for disagreement. I’m wondering if I should do another post working through the argument in exhaustive detail, showing each of the steps, what facts it relies upon, and where they come from.
I think there are big chunks of the argument missing, which is why I’m commenting. I think those chunks are found in the posts I mentioned. This post focuses on what we’d want an AGI to do and why, and its understanding of that. But the much more debated and questionable step is how to make sure that it wants to do what we want.