Oliver Sourbut comments on Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations

Oliver Sourbut 8 May 2026 9:37 UTC
10 points
1
This is a great idea! Of course particularly prone to the comforting illusion of explanation.

Naively, I’d expect the ‘encoding’ (the putative explanation in the middle) to be trained into being nonsense stego text compressing the activation. How do you avoid that?
- Oliver Sourbut 8 May 2026 9:43 UTC
  8 points
  0
  Parent
  (Perhaps one way to get some evidence for/against might be to paraphrase the ‘encoding’...?)
  - Sam Marks 9 May 2026 16:19 UTC
    4 points
    0
    Parent
    The full report has some evaluations for steganography, measuring how much reconstruction falls when paraphrasing explanations in different ways.