Alignment for a self-improving system should be very much easier for quite a few reasons. There are also plenty of paths by which systems may become more powerful even without solving alignment for themselves.
A great deal of the difficulty of humans aligning a future superintelligent AI is that it is likely to be alien, fundamentally differing from human goals, modes of thought, ethics, and other important aspects of behaviour in ways that we can’t adequately model even if we could identify them all. We don’t know nearly enough about ourselves to create something sufficiently compatible with any of our values, but smarter. If we knew exactly how we ourselves thought, I’d have more more confidence that we could make serious progress in alignment.
A weakly superintelligent AI is much more likely to be able to model itself, more able to do experiments on copies, and better suited to deeply inspect itself than we are. It will know more about itself than we do, and likely more able to create something that is similar to itself only better. Unlike us it will be inherently much more portable, capable of running on hardware quite different from its original and able to improve in important capability dimensions even without changing how it thinks or behaves.
However even without any more progress on alignment than we have made, we could still face existential risk from rapidly improving superintelligent AI. Even without a very good chance of preserving all its goals, the extra power available to a self-improved or successor AI which may share some of its more important goals may outweigh the risk of never improving.
In addition, superintelligent AI may not be any more coherently utility-maximizing than we are. They could be substantially less so, while still being capable of self-improvement into existential threats. For any superintelligence, improvement in capability over human designs is probably a relatively short-term action that is relatively easy to achieve. It certainly does not require some “super unlikely case it happens to have the one exact utility function that says always maximize local increases in intelligence regardless of it’s long term effect”.
Any of these imply substantial risk to humanity from rapid capability improvement. In my opinion it requires special arguments to explain why FOOM isn’t a danger.
Alignment for a self-improving system should be very much easier for quite a few reasons. There are also plenty of paths by which systems may become more powerful even without solving alignment for themselves.
A great deal of the difficulty of humans aligning a future superintelligent AI is that it is likely to be alien, fundamentally differing from human goals, modes of thought, ethics, and other important aspects of behaviour in ways that we can’t adequately model even if we could identify them all. We don’t know nearly enough about ourselves to create something sufficiently compatible with any of our values, but smarter. If we knew exactly how we ourselves thought, I’d have more more confidence that we could make serious progress in alignment.
A weakly superintelligent AI is much more likely to be able to model itself, more able to do experiments on copies, and better suited to deeply inspect itself than we are. It will know more about itself than we do, and likely more able to create something that is similar to itself only better. Unlike us it will be inherently much more portable, capable of running on hardware quite different from its original and able to improve in important capability dimensions even without changing how it thinks or behaves.
However even without any more progress on alignment than we have made, we could still face existential risk from rapidly improving superintelligent AI. Even without a very good chance of preserving all its goals, the extra power available to a self-improved or successor AI which may share some of its more important goals may outweigh the risk of never improving.
In addition, superintelligent AI may not be any more coherently utility-maximizing than we are. They could be substantially less so, while still being capable of self-improvement into existential threats. For any superintelligence, improvement in capability over human designs is probably a relatively short-term action that is relatively easy to achieve. It certainly does not require some “super unlikely case it happens to have the one exact utility function that says always maximize local increases in intelligence regardless of it’s long term effect”.
Any of these imply substantial risk to humanity from rapid capability improvement. In my opinion it requires special arguments to explain why FOOM isn’t a danger.