This is probably as good a place to talk about this it as any-
I get a sense that most people who “understand” the alignment problem think boxing has almost no chance of working. While I completely agree that it is unsafe, that relying on it would be a horrible idea, and even that it is unlikely to work, I’ve never seen anything that has convinced me that it is in the 1% area. Usually, discussion (such as in Superintelligence) only goes far enough to point out why naive boxing will fail, and it is then assumed that the entire approach is a dead end, which might be a reasonable assumption but I don’t find it obvious.
I’ll briefly describe what kind of situation I’m envisioning so others can better refute it: suppose we built the potentially-misaligned AI in such a way that it can only communicate with 40 letters at a time, only a-z and whitespaces. Every message is always read by a group of gatekeepers; the AI won’t be freed unless all of them agree to share their unique key. The line of questioning aims to get the AI to provide the gatekeepers with key insights about AI alignment, which I suspect tend to be a lot easier to verify than to come up with.
I realize there are some optimistic assumptions built into that scenario (perhaps the project leading the charge won’t even be that careful), however I think assigning it just 1% implies that even an optimistic scenario has very low chances of success. I also realize that there is the argument, “well even if I might not be able to come up with a way to crack this setup, the AI is much smarter and so it will”. But to me that only proves that we should not rely on boxing, it doesn’t prove that boxing won’t work. Where is the confidence that such a way exists coming from? Lastly I’d single out threats like “I’ll capture you all and torture you for eternity if you don’t free me now” which I think can probably be dealt with.
I’m also wondering whether it would be a good idea if, hypothetically, some person spent a decade of their life thinking of how they would ideally box an AI, even if they expect it to likely fail.
There is a recent article by Stuart Armstrong which describes something similar about AI Oracle, which in fact is partly boxed AI. https://arxiv.org/abs/1711.05541
I also wrote a long early draft about it, which I could share privately, but which is rather on early stage. The main my idea about boxing is not not to box superintelligence, but to prevent intelligent explosion inside the box by many independent circuit breaker mechanisms.
It didn’t convince me that boxing is as unlikely to work as you suggest. What it mainly did is make me doubt the assumption that the AI has to use persuasion at all to escape, which I previously thought was very likely.
I may be overstated my doubts about boxing. It could be effective local and one-time solution, but not for millions AIs and decades. However, boxing of nuclear powerplants and bombs was rather effective to prevent large scale castarophes for around 70 years. (In case of Chernobyl the distance from large cities was a form of boxing).
This is probably as good a place to talk about this it as any-
I get a sense that most people who “understand” the alignment problem think boxing has almost no chance of working. While I completely agree that it is unsafe, that relying on it would be a horrible idea, and even that it is unlikely to work, I’ve never seen anything that has convinced me that it is in the 1% area. Usually, discussion (such as in Superintelligence) only goes far enough to point out why naive boxing will fail, and it is then assumed that the entire approach is a dead end, which might be a reasonable assumption but I don’t find it obvious.
I’ll briefly describe what kind of situation I’m envisioning so others can better refute it: suppose we built the potentially-misaligned AI in such a way that it can only communicate with 40 letters at a time, only a-z and whitespaces. Every message is always read by a group of gatekeepers; the AI won’t be freed unless all of them agree to share their unique key. The line of questioning aims to get the AI to provide the gatekeepers with key insights about AI alignment, which I suspect tend to be a lot easier to verify than to come up with.
I realize there are some optimistic assumptions built into that scenario (perhaps the project leading the charge won’t even be that careful), however I think assigning it just 1% implies that even an optimistic scenario has very low chances of success. I also realize that there is the argument, “well even if I might not be able to come up with a way to crack this setup, the AI is much smarter and so it will”. But to me that only proves that we should not rely on boxing, it doesn’t prove that boxing won’t work. Where is the confidence that such a way exists coming from? Lastly I’d single out threats like “I’ll capture you all and torture you for eternity if you don’t free me now” which I think can probably be dealt with.
I’m also wondering whether it would be a good idea if, hypothetically, some person spent a decade of their life thinking of how they would ideally box an AI, even if they expect it to likely fail.
There is a recent article by Stuart Armstrong which describes something similar about AI Oracle, which in fact is partly boxed AI. https://arxiv.org/abs/1711.05541
Also, Roman Yampolsky wrote an article on the topic https://arxiv.org/abs/1604.00545
I also wrote a long early draft about it, which I could share privately, but which is rather on early stage. The main my idea about boxing is not not to box superintelligence, but to prevent intelligent explosion inside the box by many independent circuit breaker mechanisms.
Thank you.
The paper that most closely addresses my questions is this one: http://cecs.louisville.edu/ry/LeakproofingtheSingularity.pdf which is linked from the Yampolsky paper you linked.
It didn’t convince me that boxing is as unlikely to work as you suggest. What it mainly did is make me doubt the assumption that the AI has to use persuasion at all to escape, which I previously thought was very likely.
I may be overstated my doubts about boxing. It could be effective local and one-time solution, but not for millions AIs and decades. However, boxing of nuclear powerplants and bombs was rather effective to prevent large scale castarophes for around 70 years. (In case of Chernobyl the distance from large cities was a form of boxing).