Roughly speaking, this is because when you grow minds, they don’t care about what you ask them to care about and they don’t care about what you train them to care about; instead, I expect them to care about a bunch of correlates of the training signal in weird and specific ways.
This includes an assumption that alignment must be done through training signals.
If I shared that assumption, I’d be similarly pessimistic. That seems like trying to aim a rocket with no good theory of gravitation, nor knowledge of the space it needs to pass through.
But alignment needn’t be done by defining goals or training signals, letting fly, and hoping. We can pause learning prior to human level (and potential escape), and perform “course corrections”. Aligning a partly-trained AI allows us to use its learned representations as goal/value representation, rather than guessing how to create them well enough through training with correlated rewards.
This doesn’t entirely avoid the problem that most theories don’t work on the first try. That first deployment is still unique. But lie-detector interpretability and testing can help establish alignment prior to training beyond the human level.
There are plenty of problems left to be solved, but this assumption is outdated. The problem gets a lot easier when the system’s understanding of what you want can be used to make it do what you want.
This includes an assumption that alignment must be done through training signals.
If I shared that assumption, I’d be similarly pessimistic. That seems like trying to aim a rocket with no good theory of gravitation, nor knowledge of the space it needs to pass through.
But alignment needn’t be done by defining goals or training signals, letting fly, and hoping. We can pause learning prior to human level (and potential escape), and perform “course corrections”. Aligning a partly-trained AI allows us to use its learned representations as goal/value representation, rather than guessing how to create them well enough through training with correlated rewards.
We have proposals that do this for different current approaches to AGI; see The (partial) fallacy of dumb superintelligence for more about them and this line of thinking.
This doesn’t entirely avoid the problem that most theories don’t work on the first try. That first deployment is still unique. But lie-detector interpretability and testing can help establish alignment prior to training beyond the human level.
There are plenty of problems left to be solved, but this assumption is outdated. The problem gets a lot easier when the system’s understanding of what you want can be used to make it do what you want.