metawrong comments on Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations

metawrong 9 May 2026 1:57 UTC
13 points
0
I wonder if training multiple NLAs on the same model and layer but with different seeds would converge on the same explanations?