One thing that seems under-discussed to me are methods we might use to get help from a superintelligent AGI to assist in creating systems for which we have more assurances that they are aligned (as a whole). And one reason for me thinking that it’s under-discussed is that even if we think we have succeeded with alignment, we should look for how we can use a superintelligence to verify that this is the case and add extra layers of assurance (finding the least risky methods for doing this first, and going about it in a stepwise and iterative manner).
I think that if such plans are laid out in more detail beforehand (before some team develops AGI/superintelligence I mean), and people try minimizing the degree to which such plans are “handwavy”, this may help make teams more apt to make use of techniques/procedures/strategies that can be helpful (compared to if they are improvising).
Though it may well be (not committing in either direction, but seems plausible) that to even get to the stage where you give a superintelligent AI questions/requests (without it beforehand hacking itself onto the internet, or that sort of thing), people would need to exhibit more security mindset than they are likely to do..
This doesn’t make sense to me. The superintelligence has to already be aligned in order to want to help you solve alignment. Otherwise you’re basically building its successor.
Well, if you start out with a superintelligence that you have good reasons to think is fully aligned, then that is certainly a much better situation to be in (and that’s an understatement)! Mentioning that just to make it clear that even if we see things differently, there is partial agreement :)
Lets imagine a superintelligent AI, and let’s describe it (a bit anthropomorphicly) as not “wanting” to help me solve alignment. Lets say that instead what it “wants” is to (for example) maximize it’s reward signal, and that the best way to maximize it’s reward signal would be to exterminate humanity and take over the world (for reasons Eliezer and others have outlined). Well, in that case it will be looking for ways to overthrow humanity, but if it isn’t able to do that it may want to do the next best thing (providing outputs that its operators respond to with a high reward signal).
So it may well prefer to “trick” me. But if it can’t “trick” me, it may prefer to give me answers that seem good to me (rather than answers that I recognize as clearly bad and evasive).
Machine learning techniques will tend to select for systems that do things that seem impressive and helpful. Unfortunately this does not guarantee “deep” alignment, but I presume that it will select for systems that at least seem aligned on the surface.
There are lots of risks and difficulties involved with asking questions/requests of AIs. But there are more and less ways dangerous of interacting with a potentially unaligned AGI, and questions/requests vary a lot in how easy or hard it is to verify whether or not they provide us what we want. The techniques/outlines I will outline in the series are intended to minimize risk of being “tricked”, and I think they could get us pretty far, but I could be wrong somehow, and it’s a long/complicated discussion.
I think there are various very powerful methods that can be used to make it hard for AGI-system to not provide what we want in process of creating aligned AGI-system. But I don’t disagree in regards to what you say about it being “extremely dangerous”. I think one argument in favor of the kinds of strategies I have in mind is that they may help give an extra layer of security/alignment-assurance, even if we think we have succeeded with alignment beforehand.
Really well-written post.
One thing that seems under-discussed to me are methods we might use to get help from a superintelligent AGI to assist in creating systems for which we have more assurances that they are aligned (as a whole). And one reason for me thinking that it’s under-discussed is that even if we think we have succeeded with alignment, we should look for how we can use a superintelligence to verify that this is the case and add extra layers of assurance (finding the least risky methods for doing this first, and going about it in a stepwise and iterative manner).
I think that if such plans are laid out in more detail beforehand (before some team develops AGI/superintelligence I mean), and people try minimizing the degree to which such plans are “handwavy”, this may help make teams more apt to make use of techniques/procedures/strategies that can be helpful (compared to if they are improvising).
Have started writing about this here if you are interested (but part 2 and 3 will probably be more substantial than part 1): https://www.lesswrong.com/posts/ZmZBataeY58anJRBb/agi-assisted-alignment-part-1-introduction
Though it may well be (not committing in either direction, but seems plausible) that to even get to the stage where you give a superintelligent AI questions/requests (without it beforehand hacking itself onto the internet, or that sort of thing), people would need to exhibit more security mindset than they are likely to do..
This doesn’t make sense to me. The superintelligence has to already be aligned in order to want to help you solve alignment. Otherwise you’re basically building its successor.
Well, if you start out with a superintelligence that you have good reasons to think is fully aligned, then that is certainly a much better situation to be in (and that’s an understatement)! Mentioning that just to make it clear that even if we see things differently, there is partial agreement :)
Lets imagine a superintelligent AI, and let’s describe it (a bit anthropomorphicly) as not “wanting” to help me solve alignment. Lets say that instead what it “wants” is to (for example) maximize it’s reward signal, and that the best way to maximize it’s reward signal would be to exterminate humanity and take over the world (for reasons Eliezer and others have outlined). Well, in that case it will be looking for ways to overthrow humanity, but if it isn’t able to do that it may want to do the next best thing (providing outputs that its operators respond to with a high reward signal).
So it may well prefer to “trick” me. But if it can’t “trick” me, it may prefer to give me answers that seem good to me (rather than answers that I recognize as clearly bad and evasive).
Machine learning techniques will tend to select for systems that do things that seem impressive and helpful. Unfortunately this does not guarantee “deep” alignment, but I presume that it will select for systems that at least seem aligned on the surface.
There are lots of risks and difficulties involved with asking questions/requests of AIs. But there are more and less ways dangerous of interacting with a potentially unaligned AGI, and questions/requests vary a lot in how easy or hard it is to verify whether or not they provide us what we want. The techniques/outlines I will outline in the series are intended to minimize risk of being “tricked”, and I think they could get us pretty far, but I could be wrong somehow, and it’s a long/complicated discussion.
Yeah, this sounds extremely dangerous and extremely unlikely to work, but I hope I’m wrong and you’ve found something potentially useful.
I think there are various very powerful methods that can be used to make it hard for AGI-system to not provide what we want in process of creating aligned AGI-system. But I don’t disagree in regards to what you say about it being “extremely dangerous”. I think one argument in favor of the kinds of strategies I have in mind is that they may help give an extra layer of security/alignment-assurance, even if we think we have succeeded with alignment beforehand.