To clarify what I did in these toy steganography experiments: First I tried to pass some more complex information and gave models more freedom in what to output. But as I kept failing at that, I kept simplifying the setup, until I tried to have a model pass just one bit to another model, assuming they both see one secret bit and an overseer tries to guess the message but doesn’t see the secret bit. So a perfect code exists: XORing the message with the secret bit, but the models kept failing to find it. I expect that more complex settings will have a similar problem—that it’s hard to create a new encoding for some information because it’s rewarded only once it is both successfully sent and successfully received.
But the settings you describe may be different by the fact that they are natural, and there could be some unintentionally outputted subtext that accidentally correlates with the secret information. So the model would already by default “successfully send” the encoded information—now it just needs to learn to decode it.
It’s not clear to me to what extent this happens. I would definitely want to see some concrete demo of this happening, but right now I don’t have any good ideas how to elicit it.
Also, I wonder how far could we get by using adversaries who try to decode the information encoded in the model’s outputs.
That’s a really good point.
To clarify what I did in these toy steganography experiments: First I tried to pass some more complex information and gave models more freedom in what to output. But as I kept failing at that, I kept simplifying the setup, until I tried to have a model pass just one bit to another model, assuming they both see one secret bit and an overseer tries to guess the message but doesn’t see the secret bit. So a perfect code exists: XORing the message with the secret bit, but the models kept failing to find it. I expect that more complex settings will have a similar problem—that it’s hard to create a new encoding for some information because it’s rewarded only once it is both successfully sent and successfully received.
But the settings you describe may be different by the fact that they are natural, and there could be some unintentionally outputted subtext that accidentally correlates with the secret information. So the model would already by default “successfully send” the encoded information—now it just needs to learn to decode it.
It’s not clear to me to what extent this happens. I would definitely want to see some concrete demo of this happening, but right now I don’t have any good ideas how to elicit it.
Also, I wonder how far could we get by using adversaries who try to decode the information encoded in the model’s outputs.