I think your central point is that we should clarify these scenarios, and I very much agree.
I also found those accounts important but incomplete. I wondered if the authors were assuming near-miss alignment, like AI that follows laws, or human misuse, like telling your intent-aligned AI to “go run this company according to the goals laid out in its corporate constitution” which winds up being just make all the money you can.
The first danger can be met with: for the love of god, get alignment right and don’t use an idiotic target like “follow the laws of the nation you originated in but otherwise do whatever you like.” It seems like this type of failure is a fear of an entire world that has paid zero attention to the warnings from worriers that AI will keep improving and following its goals to the extreme. I don’t think we’ll sleepwalk into that scenario.
The second worry is, I guess, a variant of the first: that we’ll use intent-aligned AI very foolishly. That would be issuing a command like “”follow the laws of the nation you originated in but otherwise do whatever you like.” I guess a key consideration in both cases is whether there’s an adequate level of corrigibility.
I guess I find the first scenario too foolish for even humans to fall into. Building AI with one of the exact goals people have been warning you about forever, “just make money”, is just too dumb.
But the second seems all too plausible in a world with widely proliferated intent-aligned AGI. I can see us arriving at autonomous AI/AGI with some level of intent alignment and assuming we can always go back and tell the AI to stand down, then getting complacent and discovering that it’s not really as corrigible as you hoped after it’s learned and changed its beliefs about things like “following instructions”.
The second worry is, I guess, a variant of the first: that we’ll use intent-aligned AI very foolishly. That would be issuing a command like “”follow the laws of the nation you originated in but otherwise do whatever you like.” I guess a key consideration in both cases is whether there’s an adequate level of corrigibility.
I’d flag that I suspect that we really should have AI systems forecasting the future and the results of possible requests.
So if people made a broad request like, “follow the laws of the nation you originated in but otherwise do whatever you like”, they should see forecasts for what that would lead to. If there’s any clearly problematic outcomes, those should be apparent early on.
This seems like it would require either very dumb humans, or a straightforward alignment mistake risk failure, to mess up.
This seems like it would require either very dumb humans, or a straightforward alignment mistake risk failure, to mess up.
I think “very dumb humans” is what we have to work with. Remember, it only requires a small number of imperfectly aligned humans to ignore the warnings (or, indeed, to welcome the world the warnings describe).
In many worlds, if we have a bunch of decently smart humans around, they would know what specific situations “very dumb humans” would mess up, and take the corresponding preventative measures.
A world where many small pockets of “highly dumb humans” could cause an existential catastrophe is one that’s very clearly incredibly fragile and dangerous, enough so that I assume reasonable actors would freak out until it stops being so fragile and dangerous. I think we see this in other areas—like cyber attacks, where reasonable people prevent small clusters of actors from causing catastrophic damage.
It’s possible that the offense/defense balance would dramatically favor tiny groups of dumb actors, and I assume that this is what you and others expect, but I don’t see it yet.
How do you propose that reasonable actors prevent reality from being fragile and dangerous?
Cyber attacks are generally based on poor protocols. Over time smart reasonable people can convince less smart reasonable people to follow better ones. Can reasonable people convince reality to follow better protocols?
As soon as you get into proposing solutions to this sort of problem, they start to look a lot less reasonable by current standards.
I think your central point is that we should clarify these scenarios, and I very much agree.
I also found those accounts important but incomplete. I wondered if the authors were assuming near-miss alignment, like AI that follows laws, or human misuse, like telling your intent-aligned AI to “go run this company according to the goals laid out in its corporate constitution” which winds up being just make all the money you can.
The first danger can be met with: for the love of god, get alignment right and don’t use an idiotic target like “follow the laws of the nation you originated in but otherwise do whatever you like.” It seems like this type of failure is a fear of an entire world that has paid zero attention to the warnings from worriers that AI will keep improving and following its goals to the extreme. I don’t think we’ll sleepwalk into that scenario.
The second worry is, I guess, a variant of the first: that we’ll use intent-aligned AI very foolishly. That would be issuing a command like “”follow the laws of the nation you originated in but otherwise do whatever you like.” I guess a key consideration in both cases is whether there’s an adequate level of corrigibility.
I guess I find the first scenario too foolish for even humans to fall into. Building AI with one of the exact goals people have been warning you about forever, “just make money”, is just too dumb.
But the second seems all too plausible in a world with widely proliferated intent-aligned AGI. I can see us arriving at autonomous AI/AGI with some level of intent alignment and assuming we can always go back and tell the AI to stand down, then getting complacent and discovering that it’s not really as corrigible as you hoped after it’s learned and changed its beliefs about things like “following instructions”.
I’d flag that I suspect that we really should have AI systems forecasting the future and the results of possible requests.
So if people made a broad request like, “follow the laws of the nation you originated in but otherwise do whatever you like”, they should see forecasts for what that would lead to. If there’s any clearly problematic outcomes, those should be apparent early on.
This seems like it would require either very dumb humans, or a straightforward alignment mistake risk failure, to mess up.
I think “very dumb humans” is what we have to work with. Remember, it only requires a small number of imperfectly aligned humans to ignore the warnings (or, indeed, to welcome the world the warnings describe).
In many worlds, if we have a bunch of decently smart humans around, they would know what specific situations “very dumb humans” would mess up, and take the corresponding preventative measures.
A world where many small pockets of “highly dumb humans” could cause an existential catastrophe is one that’s very clearly incredibly fragile and dangerous, enough so that I assume reasonable actors would freak out until it stops being so fragile and dangerous. I think we see this in other areas—like cyber attacks, where reasonable people prevent small clusters of actors from causing catastrophic damage.
It’s possible that the offense/defense balance would dramatically favor tiny groups of dumb actors, and I assume that this is what you and others expect, but I don’t see it yet.
How do you propose that reasonable actors prevent reality from being fragile and dangerous?
Cyber attacks are generally based on poor protocols. Over time smart reasonable people can convince less smart reasonable people to follow better ones. Can reasonable people convince reality to follow better protocols?
As soon as you get into proposing solutions to this sort of problem, they start to look a lot less reasonable by current standards.