Yes, agreed! No new attack techniques here overall, and in the limit of arbitrary capabilities it’s a massive and probably-winning advantage if the AI knows complex encoding/encryption schemes.
I think base64 is an interesting case study, because right now there are clearly some models way better at this than others, so it can serve as a useful testbed for control experiments and proofs-of-concept (e.g. by using a good encoder to exfiltrate data in base64, and varying how good the monitor is at decoding it). Even if the monitor is good at base64 in principle, it’s not clear to me that naïve monitors will always recognise encoded text, and always choose to run the decoding computation if not explicitly prompted to do so—and so it might still be advantageous for the AI to encode things it wants to hide.
Yes, agreed! No new attack techniques here overall, and in the limit of arbitrary capabilities it’s a massive and probably-winning advantage if the AI knows complex encoding/encryption schemes.
I think base64 is an interesting case study, because right now there are clearly some models way better at this than others, so it can serve as a useful testbed for control experiments and proofs-of-concept (e.g. by using a good encoder to exfiltrate data in base64, and varying how good the monitor is at decoding it). Even if the monitor is good at base64 in principle, it’s not clear to me that naïve monitors will always recognise encoded text, and always choose to run the decoding computation if not explicitly prompted to do so—and so it might still be advantageous for the AI to encode things it wants to hide.