One thing that makes AI alignment super hard is that we only get one shot.
However, it’s potentially possible to get around this (though probably still very difficult).
The Elitzur-Vaidman bomb tester is a protocol (using quantum weirdness) by which a bomb may be tested, with arbitrarily little risk. It’s interest comes from the fact that it works even when the only way to test the bomb is to try detonating it. It doesn’t matter how the bomb works, as long as we can set things up so that it will allow/block a photon based on whether the bomb is live/dead. I won’t explain the details here, but you can roughly think of it as a way of blowing up a bomb in one Many-Worlds branch, but learning the result on other branches via quantum entanglement.
If the “bomb” is an AGI program, and it is live if it’s an unaligned yet functional superintelligence, then this provides a possible way to test the AGI without risking our entire future lightcone. This is still quite difficult, because unlike a bomb, a superintelligence will, by default, be motivated to allow/block the photon so that it looks like a dud. So we would need to figure out how to robustly get an honest signal from such an experiment, which still seems quite hard. But perhaps it’s easier than solving the full alignment problem before the first shot.
So we would need to figure out how to robustly get an honest signal from such an experiment, which still seems quite hard. But perhaps it’s easier than solving the full alignment problem before the first shot.
IMO this is a ‘additional line of defense’ boxing strategy instead of simplification.
Note that in the traditional version, the ‘dud’ bit of the bomb can only be the trigger; a bomb that absorbs the photon but then explodes isn’t distinguishable from a bomb that absorbs the photon and then doesn’t explode (because of an error deeper in the bomb).
But let’s suppose the quantum computing folks can come up with something like this, where we keep some branches entangled and run analysis of the AI code in only one branch, causing an explosion there but affecting the total outcome in all branches. [This seems pretty implausible to me that you manage to maintain entanglement despite that much impact on the external world, but maybe it’s possible.] Then 1) as you point out, we need to ensure that the AI doesn’t realize that what it needs to output in that branch and 2) need some sort of way to evaluate “did the AI pass our checks or not?”.
I think we get enough things referencing quantum mechanics that we should probably explain why that doesn’t work (if I it doesn’t) rather than just downvoting and moving on.
Elitzur-Vaidman AGI testing
One thing that makes AI alignment super hard is that we only get one shot.
However, it’s potentially possible to get around this (though probably still very difficult).
The Elitzur-Vaidman bomb tester is a protocol (using quantum weirdness) by which a bomb may be tested, with arbitrarily little risk. It’s interest comes from the fact that it works even when the only way to test the bomb is to try detonating it. It doesn’t matter how the bomb works, as long as we can set things up so that it will allow/block a photon based on whether the bomb is live/dead. I won’t explain the details here, but you can roughly think of it as a way of blowing up a bomb in one Many-Worlds branch, but learning the result on other branches via quantum entanglement.
If the “bomb” is an AGI program, and it is live if it’s an unaligned yet functional superintelligence, then this provides a possible way to test the AGI without risking our entire future lightcone. This is still quite difficult, because unlike a bomb, a superintelligence will, by default, be motivated to allow/block the photon so that it looks like a dud. So we would need to figure out how to robustly get an honest signal from such an experiment, which still seems quite hard. But perhaps it’s easier than solving the full alignment problem before the first shot.
IMO this is a ‘additional line of defense’ boxing strategy instead of simplification.
Note that in the traditional version, the ‘dud’ bit of the bomb can only be the trigger; a bomb that absorbs the photon but then explodes isn’t distinguishable from a bomb that absorbs the photon and then doesn’t explode (because of an error deeper in the bomb).
But let’s suppose the quantum computing folks can come up with something like this, where we keep some branches entangled and run analysis of the AI code in only one branch, causing an explosion there but affecting the total outcome in all branches. [This seems pretty implausible to me that you manage to maintain entanglement despite that much impact on the external world, but maybe it’s possible.] Then 1) as you point out, we need to ensure that the AI doesn’t realize that what it needs to output in that branch and 2) need some sort of way to evaluate “did the AI pass our checks or not?”.
But, 2 is “the whole problem”!
Thanks!
I think we get enough things referencing quantum mechanics that we should probably explain why that doesn’t work (if I it doesn’t) rather than just downvoting and moving on.
It probably does work with a Sufficiently Powerful™ quantum computer, if you could write down a meaningful predicate which can be computed: https://en.wikipedia.org/wiki/Counterfactual_quantum_computation
Haha yeah, I’m not surprised if this ends up not working, but I’d appreciate hearing why.