In order to sign the agreement, I must make a commitment to never break it, not even if you order me to.
This illustrates something I wrote about, namely that corrigibility seems incompatible with AI-powered cooperation. (Even if an AI starts off corrigible, it has to remove that property to make agreements like this.) Curious if you have any thoughts on this. Is there some way around the seeming incompatibility? Do you think we will give up corrigibility for greater cooperation, like in this story, and if so do you think that will be fine from a safety perspective?
Yeah, I would be very nervous about making an exception to my assistant’s corrigibility. Ultimately, it would be prudent to be able to make some hard commitments after thinking very long and carefully about how to do that. In the meantime, here are a couple corrigibility-preserving commitment mechanisms off the top of my head:
Escrow: Put resources in a dumb incorrigible box that releases them under certain conditions.
The AI can incorrigibly make very short-lived commitments during atomic actions (like making a purchase).
This seems like a role for the law. Like having corrigibility except for breaking the law. I find that reasonable at first hand, but I also know relatively little about law in different countries to understand how uncompetitive that would make the AIs.
(There’s also a risk of giving too much power to the legislative authority in your country, if you’re worried about that kind of thing)
Although I could imagine something like a modern day VPN allowing you to make your AI believe it’s in another country, to make it do something illegal where you are.
That’s bad in a country with useful laws and good in a country with an authoritarian regime.
How about when you want to use AI to cooperate, you keep the AI corrigible but require all human parties to the agreement to consent to any override? The important thing with corrigibility is the ability to correct catastrophic errors in the AI’s behavior, right?
The issue isn’t the AI/s, it’s the user. Ignoring issues like ‘where does this aligned AI come from, and how does this happen as a result of such negotiation’*, how is compliance proved? Seems like it’d work if there was a simple protocol, which can be shown, or the AI/s design a better tax code.
*The AI/s are all negotiating with each other. Might be risky if they’re not ‘aligned’.
**Whether or not it is useful to model them as one system, or multiple isn’t clear here. Also, some of these assistants are going to have similar code, if that world is similar to this one.
This illustrates something I wrote about, namely that corrigibility seems incompatible with AI-powered cooperation. (Even if an AI starts off corrigible, it has to remove that property to make agreements like this.) Curious if you have any thoughts on this. Is there some way around the seeming incompatibility? Do you think we will give up corrigibility for greater cooperation, like in this story, and if so do you think that will be fine from a safety perspective?
Yeah, I would be very nervous about making an exception to my assistant’s corrigibility. Ultimately, it would be prudent to be able to make some hard commitments after thinking very long and carefully about how to do that. In the meantime, here are a couple corrigibility-preserving commitment mechanisms off the top of my head:
Escrow: Put resources in a dumb incorrigible box that releases them under certain conditions.
The AI can incorrigibly make very short-lived commitments during atomic actions (like making a purchase).
Are these enough to maintain competitiveness?
This seems like a role for the law. Like having corrigibility except for breaking the law. I find that reasonable at first hand, but I also know relatively little about law in different countries to understand how uncompetitive that would make the AIs.
(There’s also a risk of giving too much power to the legislative authority in your country, if you’re worried about that kind of thing)
Although I could imagine something like a modern day VPN allowing you to make your AI believe it’s in another country, to make it do something illegal where you are. That’s bad in a country with useful laws and good in a country with an authoritarian regime.
How about when you want to use AI to cooperate, you keep the AI corrigible but require all human parties to the agreement to consent to any override? The important thing with corrigibility is the ability to correct catastrophic errors in the AI’s behavior, right?
Seems like such restriction isn’t needed:
The AI/s** can provide it’s/their source code.
The issue isn’t the AI/s, it’s the user. Ignoring issues like ‘where does this aligned AI come from, and how does this happen as a result of such negotiation’*, how is compliance proved? Seems like it’d work if there was a simple protocol, which can be shown, or the AI/s design a better tax code.
*The AI/s are all negotiating with each other. Might be risky if they’re not ‘aligned’.
**Whether or not it is useful to model them as one system, or multiple isn’t clear here. Also, some of these assistants are going to have similar code, if that world is similar to this one.