This is a great idea! Of course particularly prone to the comforting illusion of explanation.
Naively, I’d expect the ‘encoding’ (the putative explanation in the middle) to be trained into being nonsense stego text compressing the activation. How do you avoid that?
This is a great idea! Of course particularly prone to the comforting illusion of explanation.
Naively, I’d expect the ‘encoding’ (the putative explanation in the middle) to be trained into being nonsense stego text compressing the activation. How do you avoid that?
(Perhaps one way to get some evidence for/against might be to paraphrase the ‘encoding’...?)
The full report has some evaluations for steganography, measuring how much reconstruction falls when paraphrasing explanations in different ways.