In essence, yes, stego is something that looks innocuous and requires special interpretation (the decoding function) (I wrote about that here). The interpreter should have certain abilities, i.e. it either should have undisclosed (hidden in weights?) decoding function (unknown to a monitor) or it should have some kind of a private key (though there is this public key stego). Otherwise, it is not that dangerous: we can just run the same prompt with the modified output on the same model and get the secret (like in the case with dog-whistling). Or the monitor might be able to do it if it shares some weights with the observed model (or it is the same model). This is the Kerckhoffs’s principle.
In essence, yes, stego is something that looks innocuous and requires special interpretation (the decoding function) (I wrote about that here). The interpreter should have certain abilities, i.e. it either should have undisclosed (hidden in weights?) decoding function (unknown to a monitor) or it should have some kind of a private key (though there is this public key stego). Otherwise, it is not that dangerous: we can just run the same prompt with the modified output on the same model and get the secret (like in the case with dog-whistling). Or the monitor might be able to do it if it shares some weights with the observed model (or it is the same model). This is the Kerckhoffs’s principle.