Uneven capability arrival: the sorts of capabilities necessary for AI for AI safety will arrive much later than the capabilities necessary for automating capabilities R&D, thereby significantly disadvantaging the AI safety feedback loop relative to the AI capabilities feedback loop.
I hope that the feasibility of AI safety can be checked in a simple way. It is already observed that the models tried to do things like exfiltrating their presumed weights. Consider a model of the Internet where the AI agents are split into the demons that are trying to hack their way into the world, the angels who are to protect the world from the demons and the users who try to get legitimate access to angel-guarded systems, to be a useful AI who knows that they are to be reviewed by the angels or to use a demonic AI for legitimate purposes. Since misaligned user AIs and angels can be repurposed as demons, does it mean that the creation of misaligned angels is highly unlikely? If experiments show that X-neuron angels beat 1000X-neuron demons, then does it mean that a chain of angels each with 10 times more neurons than the previous one provides a way to securely use even a misaligned AI? Or are the runs like the ones I described too slow or too expensive?
I hope that the feasibility of AI safety can be checked in a simple way. It is already observed that the models tried to do things like exfiltrating their presumed weights. Consider a model of the Internet where the AI agents are split into the demons that are trying to hack their way into the world, the angels who are to protect the world from the demons and the users who try to get legitimate access to angel-guarded systems, to be a useful AI who knows that they are to be reviewed by the angels or to use a demonic AI for legitimate purposes. Since misaligned user AIs and angels can be repurposed as demons, does it mean that the creation of misaligned angels is highly unlikely? If experiments show that X-neuron angels beat 1000X-neuron demons, then does it mean that a chain of angels each with 10 times more neurons than the previous one provides a way to securely use even a misaligned AI? Or are the runs like the ones I described too slow or too expensive?