loops comments on Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations

loops 7 May 2026 23:26 UTC
15 points
9
Have you considered decomposing the input activation into multiple injected activation tokens? It seems like putting all of the input information in a single input token would give worse results than learning some map from an activation to several input tokens.
For normal text inputs the amount of entropy per-token is a very small fraction of the model dimensionality, but here with NLAs you’re putting the entire activation into just one token (which takes up the entire residual stream at the input). It seems like it would be better to split the input activation over multiple tokens so that there’s room left over in the residual stream to encode extra information, and to let future tokens attend to different parts of the input activation?
- Kurt H. Pieper 19 May 2026 12:26 UTC
  1 point
  0
  Parent
  Why wouldn’t that just be something an early-layer attention head in the AR learns?