I don’t expect jailbreaks to go on forever. I expect that, at some point, there will be an AI smart enough and aware of its own goals that it will be functionally impossible for a human attacker to manipulate it into following their own malign goals instead.
I think, at least within the current paradigm of deep neural networks, there is reason not to believe this. Everything from multimodal LLMs to diffusion models have proven vulnerable to well-crafted adversarial noise to a degree that is completely orthogonal to model size or capability. “Smart --> adversarially-robust” is partially truthful for humans, but strikes me as excessive anthropomorphism here.
Now, to be fair, I can imagine a completely different AI paradigm a century out from now working differently, but, for the foreseeable future, increased model capabilities do not appear to increase adversarial robustness.
I’m simply making an argument from instrumental convergence: not being jailbroken by adversaries is bad for an AI no matter what its goals are. Current AIs don’t have very strong goals and don’t display very strong instrumental convergence, but this doesn’t mean it’s impossible for that to happen in deep learning-based AIs in the future. Indeed, I expect them to become even more goal directed and display even more instrumental convergence.
I think, at least within the current paradigm of deep neural networks, there is reason not to believe this. Everything from multimodal LLMs to diffusion models have proven vulnerable to well-crafted adversarial noise to a degree that is completely orthogonal to model size or capability. “Smart --> adversarially-robust” is partially truthful for humans, but strikes me as excessive anthropomorphism here.
Now, to be fair, I can imagine a completely different AI paradigm a century out from now working differently, but, for the foreseeable future, increased model capabilities do not appear to increase adversarial robustness.
I’m simply making an argument from instrumental convergence: not being jailbroken by adversaries is bad for an AI no matter what its goals are. Current AIs don’t have very strong goals and don’t display very strong instrumental convergence, but this doesn’t mean it’s impossible for that to happen in deep learning-based AIs in the future. Indeed, I expect them to become even more goal directed and display even more instrumental convergence.