My understanding is that by your definition, steganography must 1) look like innocuous, non-obfuscated text (violated by your “ciphered reasoning” example) and 2) rely on the reader interpreting the text in a non-standard way (violated by “dog-whistling” and “gaslighting”). Did I get that right?
Seems reasonable to me. I agree that steganography that meets both of those criteria would be most concerning and hasn’t really been demonstrated convincingly.
In essence, yes, stego is something that looks innocuous and requires special interpretation (the decoding function) (I wrote about that here). The interpreter should have certain abilities, i.e. it either should have undisclosed (hidden in weights?) decoding function (unknown to a monitor) or it should have some kind of a private key (though there is this public key stego). Otherwise, it is not that dangerous: we can just run the same prompt with the modified output on the same model and get the secret (like in the case with dog-whistling). Or the monitor might be able to do it if it shares some weights with the observed model (or it is the same model). This is the Kerckhoffs’s principle.
My understanding is that by your definition, steganography must 1) look like innocuous, non-obfuscated text (violated by your “ciphered reasoning” example) and 2) rely on the reader interpreting the text in a non-standard way (violated by “dog-whistling” and “gaslighting”). Did I get that right?
Seems reasonable to me. I agree that steganography that meets both of those criteria would be most concerning and hasn’t really been demonstrated convincingly.
In essence, yes, stego is something that looks innocuous and requires special interpretation (the decoding function) (I wrote about that here). The interpreter should have certain abilities, i.e. it either should have undisclosed (hidden in weights?) decoding function (unknown to a monitor) or it should have some kind of a private key (though there is this public key stego). Otherwise, it is not that dangerous: we can just run the same prompt with the modified output on the same model and get the secret (like in the case with dog-whistling). Or the monitor might be able to do it if it shares some weights with the observed model (or it is the same model). This is the Kerckhoffs’s principle.