I think you’re using a false dichotomy when you say that either superintelligence values will be locked in or they will be corrigible.
There is an in-between where superintelligence won’t help with power grabs and won’t do other awful things, but it will allow its values to be changed if there is a legitimate process that supports that change, with multiple stakeholders signing off. This would allow society to change the AI’s values and behaviors as it likes but no small group to change it so the AI helps them seize power. It is essentially corrigible to a broader legitimate process rather than to any individual user.
That’s the kind of AI that I think could allow us to navigate these problems as we go without pause
(I think we should pause or at least significantly slow down despite this objection!)
This is a subject that probably deserves more careful attention but here is my basic thinking:
Either ASI has more than zero values locked in, or it’s fully corrigible. If any values at all are locked in, then we need to have a pretty robust understanding of what the consequences of that will be, because we can’t change it ever. Like I don’t think we know how to encode something like “don’t let people do power grabs, but be fully corrigible in every other way”. I don’t know how much that’s downstream of the facts that (1) we don’t know how to encode any values at all and (2) we don’t know how to encode corrigibility, but my intuition is that even if we solve #1 and #2, the problem of “don’t pick incorrigible values that will screw everything up down the road” is still a hard problem.
This is related to Max Harms’ work on CAST. Part of his argument is that pure corrigibility is a more robust target than any set of values because a near miss fails gracefully. Whereas if you try to encode any values at all, a near miss could be catastrophic. He’s talking more about the “AI kills everyone” flavor of catastrophe, which is valid, but what I’m talking about here is more that a near miss could permanently lock us in to a bad (or maybe just not-that-good) future. Different argument but the concern arises for a similar reason—if you’re specifying values, then you have to get the specification right, beyond just ensuring that the AI does what you want.
I think you’re using a false dichotomy when you say that either superintelligence values will be locked in or they will be corrigible.
There is an in-between where superintelligence won’t help with power grabs and won’t do other awful things, but it will allow its values to be changed if there is a legitimate process that supports that change, with multiple stakeholders signing off. This would allow society to change the AI’s values and behaviors as it likes but no small group to change it so the AI helps them seize power. It is essentially corrigible to a broader legitimate process rather than to any individual user.
That’s the kind of AI that I think could allow us to navigate these problems as we go without pause
(I think we should pause or at least significantly slow down despite this objection!)
This is a subject that probably deserves more careful attention but here is my basic thinking:
Either ASI has more than zero values locked in, or it’s fully corrigible. If any values at all are locked in, then we need to have a pretty robust understanding of what the consequences of that will be, because we can’t change it ever. Like I don’t think we know how to encode something like “don’t let people do power grabs, but be fully corrigible in every other way”. I don’t know how much that’s downstream of the facts that (1) we don’t know how to encode any values at all and (2) we don’t know how to encode corrigibility, but my intuition is that even if we solve #1 and #2, the problem of “don’t pick incorrigible values that will screw everything up down the road” is still a hard problem.
This is related to Max Harms’ work on CAST. Part of his argument is that pure corrigibility is a more robust target than any set of values because a near miss fails gracefully. Whereas if you try to encode any values at all, a near miss could be catastrophic. He’s talking more about the “AI kills everyone” flavor of catastrophe, which is valid, but what I’m talking about here is more that a near miss could permanently lock us in to a bad (or maybe just not-that-good) future. Different argument but the concern arises for a similar reason—if you’re specifying values, then you have to get the specification right, beyond just ensuring that the AI does what you want.