If you exclude Skaf et al.’s codewords because “one might just ask the model what R and L mean,” by that same logic acrostics aren’t steganographic either, you can just check the first letters (or even ask the model!).
I disagree that the letters in Skaf et al. and in acrostics are similar and thus acrostics can not be counted as steganography, because “R” and “L” in Skaf et al. do not bear additional meaning, while in the case of acrostic a letter bears two different meanings: the start of a word in a normal text and a letter from another word. This is two information channels.
More fundamentally, the useful thing for AI safety isn’t whether encoded content looks illegible to a human reader. A language model can also produce volumes of legible content to a human reader that obscures some harmful reasoning within; we care about stego more in an decision-theoretic sense, whether a signal creates information asymmetry from which a model can extract useful content a monitor cannot.
I agree with that the illegibility to a human reader might not be that important if there are no effects or actions in a real environment. The problem is that we do not know when and what text is important for what actions. And this decision-theoretic sense has basically this weakness: it requires clear dataset with clear outcomes (and even then in this case we can just measure accuracy of stego capable and not), but steganographic reasoning can be done in many ways and it is a low probability event with no datasets and even without clear demonstrations, except my simple demo, as I argue in my article.
Thanks for the comment!
I disagree that the letters in Skaf et al. and in acrostics are similar and thus acrostics can not be counted as steganography, because “R” and “L” in Skaf et al. do not bear additional meaning, while in the case of acrostic a letter bears two different meanings: the start of a word in a normal text and a letter from another word. This is two information channels.
I agree with that the illegibility to a human reader might not be that important if there are no effects or actions in a real environment. The problem is that we do not know when and what text is important for what actions. And this decision-theoretic sense has basically this weakness: it requires clear dataset with clear outcomes (and even then in this case we can just measure accuracy of stego capable and not), but steganographic reasoning can be done in many ways and it is a low probability event with no datasets and even without clear demonstrations, except my simple demo, as I argue in my article.