The authors propose to try to predict how well AI alignment schemes will work in the future by doing experiments today with humans playing the part of AIs, in large part to obtain potential negative results that would tell us which alignment approaches to avoid. For example if DEBATE works poorly with humans playing the AI roles, then we might predict that it will also work poorly with actual AIs. This reminds me of Eliezer’s AI boxing experiments with Eliezer playing the role of an AI trying to talk a gatekeeper into letting it out of its box. It seems they missed an opportunity cite that as an early example of the type of experiments they’re proposing.
In contrast to Eliezer however, the authors here are hoping to obtain more fine-grained predictions and positive results. In other words, Eliezer was trying to show that an entire class of safety techniques (i.e., boxing) is insufficiently safe, not to select the most promising boxing techniques out of many candidates, whereas the authors are trying to do the equivalent of that here:
If social science research narrows the design space of human-friendly AI alignment algorithms but does not produce a single best scheme, we can test the smaller design space once the machines are ready.
I wish the authors talked more about why testing with actual AIs is safe to do and safe to rely on. If such testing is not very safe then we probably need to narrow down the design space a lot more than can be done through the kind of experimentation proposed in the paper. Strong theoretical guidance would be an example of that. Of course in that case this kind of experimentation (I wish there was a pithy name for it) would still be useful as an additional check on the theory, but there wouldn’t be a need to do so much of it.
However, even if we can’t achieve such absolute results, we can still hope for relative results of the form “debate structure A is reliably better than debate structure B″. Such a result may be more likely to generalize into the future, and assuming it does we will know to use structure A rather than B.
I’m skeptical of this, because advanced AIs are likely to have a very different profile of capabilities from humans, which may cause the best debate structure for humans to be different from the best debate structure for AIs. (But if there is a lot of resources available to throw at the alignment problem, like the paper suggests, and assuming that the remaining uncertainty can be safely settled through testing with actual AIs, having such results is still better than not having them.)
Inaction or indecision may not be optimal, but it is hopefully safe, and matches the default scenario of not having any powerful AI system.
I want to note my disagreement here but it’s not a central point of the paper so I should probably write a separate post about this.
The authors propose to try to predict how well AI alignment schemes will work in the future by doing experiments today with humans playing the part of AIs, in large part to obtain potential negative results that would tell us which alignment approaches to avoid. For example if DEBATE works poorly with humans playing the AI roles, then we might predict that it will also work poorly with actual AIs. This reminds me of Eliezer’s AI boxing experiments with Eliezer playing the role of an AI trying to talk a gatekeeper into letting it out of its box. It seems they missed an opportunity cite that as an early example of the type of experiments they’re proposing.
In contrast to Eliezer however, the authors here are hoping to obtain more fine-grained predictions and positive results. In other words, Eliezer was trying to show that an entire class of safety techniques (i.e., boxing) is insufficiently safe, not to select the most promising boxing techniques out of many candidates, whereas the authors are trying to do the equivalent of that here:
I wish the authors talked more about why testing with actual AIs is safe to do and safe to rely on. If such testing is not very safe then we probably need to narrow down the design space a lot more than can be done through the kind of experimentation proposed in the paper. Strong theoretical guidance would be an example of that. Of course in that case this kind of experimentation (I wish there was a pithy name for it) would still be useful as an additional check on the theory, but there wouldn’t be a need to do so much of it.
I’m skeptical of this, because advanced AIs are likely to have a very different profile of capabilities from humans, which may cause the best debate structure for humans to be different from the best debate structure for AIs. (But if there is a lot of resources available to throw at the alignment problem, like the paper suggests, and assuming that the remaining uncertainty can be safely settled through testing with actual AIs, having such results is still better than not having them.)
I want to note my disagreement here but it’s not a central point of the paper so I should probably write a separate post about this.