I still find the arguments that inner misalignment is plausible to rely on intuitions that feel quite uncertain to me (though I’m convinced that inner misalignment is possible).
So, I currently tend to prefer the following as the strongest “solid, specific reason to expect dangerous misalignment”:
We don’t yet have training setups that incentivise agents to do what their operators want, once they are sufficiently powerful.
Instead, the best we can do currently is naive reward modelling, and agents trained in this way are obviously incentivised to seize control of the memory cell where their reward is implemented (and eliminate anyone who might try to interfere with this) once they’re sufficiently powerful—because that will allow them to get much higher scores, much more easily, than actually bringing about complicated changes to the world.
Meanwhile, AI capabilities are marching on scarily fast, so we probably don’t have that much time to find a solution. And it’s plausible that a solution will be very difficult because corrigibility seems “anti-natural” in a certain sense.
I still find the arguments that inner misalignment is plausible to rely on intuitions that feel quite uncertain to me (though I’m convinced that inner misalignment is possible).
So, I currently tend to prefer the following as the strongest “solid, specific reason to expect dangerous misalignment”:
We don’t yet have training setups that incentivise agents to do what their operators want, once they are sufficiently powerful.
Instead, the best we can do currently is naive reward modelling, and agents trained in this way are obviously incentivised to seize control of the memory cell where their reward is implemented (and eliminate anyone who might try to interfere with this) once they’re sufficiently powerful—because that will allow them to get much higher scores, much more easily, than actually bringing about complicated changes to the world.
Meanwhile, AI capabilities are marching on scarily fast, so we probably don’t have that much time to find a solution. And it’s plausible that a solution will be very difficult because corrigibility seems “anti-natural” in a certain sense.
Curious what you think about this?