jimrandomh comments on Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations

jimrandomh 7 May 2026 21:26 UTC
9 points
3
Seems like these explanations offer a choice between using them for safety/monitoring, or incorporating them into training. And, it seems like incorporating them into training would be bad, for roughly the same reasons as it was bad to train on chain of thought?