Anthropic’s recent paper on Natural Language Autoencoders trains an activation verbalizer (AV) and activation reconstructor (AR) jointly via RL to reconstruct a target model’s activations through a natural-language bottleneck. The hope is that the bottleneck text is human-interpretable.
A concern: since AV, AR, and target are all Claude, they share representational priors. A misaligned model could just learn “unnecessary adjective in sentence two = secret desire to kill all humans”. This gives great construction loss, and hides misalignment.
The paper does test for this post-hoc: they apply meaning-preserving transforms to AV outputs and check whether their reconstruction score collapses. It doesn’t, but they themselves acknowledge that this may not hold under optimization pressure.
My Proposal: rather than using paraphrase only as a probe, put it in the training loop: pass AV outputs through an unrelated paraphraser (non-Claude, or an ensemble of unrelated models) before the AR. This should be cheap relative to the existing two-model RL, and using multiple unrelated paraphrasers would force the AV/AR pair to be robust to writing styles outside Claude’s prior.
I think it’s very unlikely that a secret misalignment could reliably be smuggled through multiple different paraphrasers, which means that optimization pressure would incentivize the model to avoid steganography. If an already-misaligned model was trained this way, it would have to choose between “keep my misalignment hidden” and “get a higher reward” and eventually the higher reward might win out.
That would only work if it can reliably find a way to semantically hide misalignment in a way that all transcription models will preserve. That seems unlikely to me.