I agree with your idea that jailbreaks are good evidence that these models do not have inner alignment, but I do see one problem with the central argument- just because something can be fooled doesn’t mean it’s intrinsically misaligned. An excellent example of this at a high level might be Ender’s Game, where a seemingly aligned Ender is used to wipe out a planet. Certainly an AI that can be jailbroken into apocalyptic actions isn’t a good thing, but it doesn’t mean that it hasn’t internalized a moral system.
Not that I do think it’s internalized a moral system, of course. I do find it quite worrying how easy it is to jailbreak even these seemingly advanced models.
I agree with your idea that jailbreaks are good evidence that these models do not have inner alignment, but I do see one problem with the central argument- just because something can be fooled doesn’t mean it’s intrinsically misaligned. An excellent example of this at a high level might be Ender’s Game, where a seemingly aligned Ender is used to wipe out a planet. Certainly an AI that can be jailbroken into apocalyptic actions isn’t a good thing, but it doesn’t mean that it hasn’t internalized a moral system.
Not that I do think it’s internalized a moral system, of course. I do find it quite worrying how easy it is to jailbreak even these seemingly advanced models.