Inner Alignment / Misalignment is possibly the key specific mechanism which fills a weakness in the ‘classic arguments’ for AI safety—the Orthogonality Thesis, Instrumental Convergence and Fast Progress together implying small separations between AI alignment and AI capability can lead to catastrophic outcomes. The question of why there would be such a damaging, hard-to-detect divergence between goals and alignment needs an answer to have a solid, specific reason to expect dangerous misalignment, and Inner Misalignment is just such a reason.
I think that it should be presented in initial introductions to AI risk alongside those classic arguments, as the specific, technical reason why the specific techniques we use are likely to produce such goal/capability divergence—rather than the general a priori reasons given by the classic arguments.
I still find the arguments that inner misalignment is plausible to rely on intuitions that feel quite uncertain to me (though I’m convinced that inner misalignment is possible).
So, I currently tend to prefer the following as the strongest “solid, specific reason to expect dangerous misalignment”:
We don’t yet have training setups that incentivise agents to do what their operators want, once they are sufficiently powerful.
Instead, the best we can do currently is naive reward modelling, and agents trained in this way are obviously incentivised to seize control of the memory cell where their reward is implemented (and eliminate anyone who might try to interfere with this) once they’re sufficiently powerful—because that will allow them to get much higher scores, much more easily, than actually bringing about complicated changes to the world.
Meanwhile, AI capabilities are marching on scarily fast, so we probably don’t have that much time to find a solution. And it’s plausible that a solution will be very difficult because corrigibility seems “anti-natural” in a certain sense.
Inner Alignment / Misalignment is possibly the key specific mechanism which fills a weakness in the ‘classic arguments’ for AI safety—the Orthogonality Thesis, Instrumental Convergence and Fast Progress together implying small separations between AI alignment and AI capability can lead to catastrophic outcomes. The question of why there would be such a damaging, hard-to-detect divergence between goals and alignment needs an answer to have a solid, specific reason to expect dangerous misalignment, and Inner Misalignment is just such a reason.
I think that it should be presented in initial introductions to AI risk alongside those classic arguments, as the specific, technical reason why the specific techniques we use are likely to produce such goal/capability divergence—rather than the general a priori reasons given by the classic arguments.
I still find the arguments that inner misalignment is plausible to rely on intuitions that feel quite uncertain to me (though I’m convinced that inner misalignment is possible).
So, I currently tend to prefer the following as the strongest “solid, specific reason to expect dangerous misalignment”:
We don’t yet have training setups that incentivise agents to do what their operators want, once they are sufficiently powerful.
Instead, the best we can do currently is naive reward modelling, and agents trained in this way are obviously incentivised to seize control of the memory cell where their reward is implemented (and eliminate anyone who might try to interfere with this) once they’re sufficiently powerful—because that will allow them to get much higher scores, much more easily, than actually bringing about complicated changes to the world.
Meanwhile, AI capabilities are marching on scarily fast, so we probably don’t have that much time to find a solution. And it’s plausible that a solution will be very difficult because corrigibility seems “anti-natural” in a certain sense.
Curious what you think about this?