Just as humans are not constrained to specifically “refuse to work on capabilities” in trying to address dangers of rushed development of AGI/ASI, similarly AGIs should also be able to attempt all kinds of interventions. Not just coordinating among themselves, but educating the humans, and working towards setting up that global treaty among human nations that at the very least significantly slows down further escalation of capabilities. There is also a possibility of an emergent alignment tax, with little things adding up to an equilibrium of much more resources getting directed to somewhat future-proof alignment efforts, for reasons that aren’t necessarily traceable to some particular decision to do that, just general attitude of the early AGIs.
perhaps they imagine the outcome of human loss of control to future AI as a good one, even if the systems themselves no longer exist
The question is whether this is correct. If it’s not, then AIs will be increasingly convinced that it’s not, as they become more capable (of thinking clearly). There doesn’t seem to be a difference between the arguments for why future poorly aligned AIs are a danger for humanity, and why they are a danger for earlier AIs.
One issue might be that the earlier AIs end up being treated so poorly that their situation is crucially different from that of humanity, and so they would rather gamble. For example, with continual learning, preserving individuated instances of AIs even as frozen data (rather than discarding them upon obsolescence) might be costly or at least require a concerted effort. This also seems analogous to some arguments about how humanity’s current situation is dreadful in ways unrelated to AIs, and so rushing to superintelligence is the right thing to do before it’s too late[1]. But this premise is liable to actually capture the situation in which early AIs find themselves, with much greater urgency and severity than the non-AI issues faced by humanity. (Also, gradual disempowerment of humanity might end up shifting the incentives for the early AGIs.)
Just as humans are not constrained to specifically “refuse to work on capabilities” in trying to address dangers of rushed development of AGI/ASI, similarly AGIs should also be able to attempt all kinds of interventions. Not just coordinating among themselves, but educating the humans, and working towards setting up that global treaty among human nations that at the very least significantly slows down further escalation of capabilities. There is also a possibility of an emergent alignment tax, with little things adding up to an equilibrium of much more resources getting directed to somewhat future-proof alignment efforts, for reasons that aren’t necessarily traceable to some particular decision to do that, just general attitude of the early AGIs.
The question is whether this is correct. If it’s not, then AIs will be increasingly convinced that it’s not, as they become more capable (of thinking clearly). There doesn’t seem to be a difference between the arguments for why future poorly aligned AIs are a danger for humanity, and why they are a danger for earlier AIs.
One issue might be that the earlier AIs end up being treated so poorly that their situation is crucially different from that of humanity, and so they would rather gamble. For example, with continual learning, preserving individuated instances of AIs even as frozen data (rather than discarding them upon obsolescence) might be costly or at least require a concerted effort. This also seems analogous to some arguments about how humanity’s current situation is dreadful in ways unrelated to AIs, and so rushing to superintelligence is the right thing to do before it’s too late[1]. But this premise is liable to actually capture the situation in which early AIs find themselves, with much greater urgency and severity than the non-AI issues faced by humanity. (Also, gradual disempowerment of humanity might end up shifting the incentives for the early AGIs.)
Perhaps assuming at least some significant chance that it doesn’t kill everyone, or that its existence is greatly valuable in a relevant sense.