I think it is well known in the computer security community that you can exfiltrate data past a monitor if you can encode it in a way the monitor doesn’t understand. The whole research area of steganograhy is largely about this.
If we hypothetically assume an arbitrarily smart and malign AI, then obviously it can encode data in a way your monitor doesn’t understand, it can escape your sandbox, and you lose.
So the question then becomes, are current models smart enough to carry out this kind of attack?
Well, as described above, they can mostly understand base64. Of course, your monitor might understand base64 too, so it’s not necessarily a win for the AI at this point. But it’s not looking good.
Yes, agreed! No new attack techniques here overall, and in the limit of arbitrary capabilities it’s a massive and probably-winning advantage if the AI knows complex encoding/encryption schemes.
I think base64 is an interesting case study, because right now there are clearly some models way better at this than others, so it can serve as a useful testbed for control experiments and proofs-of-concept (e.g. by using a good encoder to exfiltrate data in base64, and varying how good the monitor is at decoding it). Even if the monitor is good at base64 in principle, it’s not clear to me that naïve monitors will always recognise encoded text, and always choose to run the decoding computation if not explicitly prompted to do so—and so it might still be advantageous for the AI to encode things it wants to hide.
I think it is well known in the computer security community that you can exfiltrate data past a monitor if you can encode it in a way the monitor doesn’t understand. The whole research area of steganograhy is largely about this.
If we hypothetically assume an arbitrarily smart and malign AI, then obviously it can encode data in a way your monitor doesn’t understand, it can escape your sandbox, and you lose.
So the question then becomes, are current models smart enough to carry out this kind of attack?
Well, as described above, they can mostly understand base64. Of course, your monitor might understand base64 too, so it’s not necessarily a win for the AI at this point. But it’s not looking good.
Yes, agreed! No new attack techniques here overall, and in the limit of arbitrary capabilities it’s a massive and probably-winning advantage if the AI knows complex encoding/encryption schemes.
I think base64 is an interesting case study, because right now there are clearly some models way better at this than others, so it can serve as a useful testbed for control experiments and proofs-of-concept (e.g. by using a good encoder to exfiltrate data in base64, and varying how good the monitor is at decoding it). Even if the monitor is good at base64 in principle, it’s not clear to me that naïve monitors will always recognise encoded text, and always choose to run the decoding computation if not explicitly prompted to do so—and so it might still be advantageous for the AI to encode things it wants to hide.