Anthropic’s recent paper on Natural Language Autoencoders trains an activation verbalizer (AV) and activation reconstructor (AR) jointly via RL to reconstruct a target model’s activations through a natural-language bottleneck. The hope is that the bottleneck text is human-interpretable.
A concern: since AV, AR, and target are all Claude, they share representational priors. A misaligned model could just learn “unnecessary adjective in sentence two = secret desire to kill all humans”. This gives great construction loss, and hides misalignment.
The paper does test for this post-hoc: they apply meaning-preserving transforms to AV outputs and check whether their reconstruction score collapses. It doesn’t, but they themselves acknowledge that this may not hold under optimization pressure.
My Proposal: rather than using paraphrase only as a probe, put it in the training loop: pass AV outputs through an unrelated paraphraser (non-Claude, or an ensemble of unrelated models) before the AR. This should be cheap relative to the existing two-model RL, and using multiple unrelated paraphrasers would force the AV/AR pair to be robust to writing styles outside Claude’s prior.
I think it’s very unlikely that a secret misalignment could reliably be smuggled through multiple different paraphrasers, which means that optimization pressure would incentivize the model to avoid steganography. If an already-misaligned model was trained this way, it would have to choose between “keep my misalignment hidden” and “get a higher reward” and eventually the higher reward might win out.
I think this has a very high chance of success, but trades off the reliability of the probe—if the NLA somehow learns how to hide misalignment semantically (and not steganographically), this type of probe becomes useless.
That would only work if it can reliably find a way to semantically hide misalignment in a way that all transcription models will preserve. That seems unlikely to me.
Anthropic’s recent paper on Natural Language Autoencoders trains an activation verbalizer (AV) and activation reconstructor (AR) jointly via RL to reconstruct a target model’s activations through a natural-language bottleneck. The hope is that the bottleneck text is human-interpretable.
A concern: since AV, AR, and target are all Claude, they share representational priors. A misaligned model could just learn “unnecessary adjective in sentence two = secret desire to kill all humans”. This gives great construction loss, and hides misalignment.
The paper does test for this post-hoc: they apply meaning-preserving transforms to AV outputs and check whether their reconstruction score collapses. It doesn’t, but they themselves acknowledge that this may not hold under optimization pressure.
My Proposal: rather than using paraphrase only as a probe, put it in the training loop: pass AV outputs through an unrelated paraphraser (non-Claude, or an ensemble of unrelated models) before the AR. This should be cheap relative to the existing two-model RL, and using multiple unrelated paraphrasers would force the AV/AR pair to be robust to writing styles outside Claude’s prior.
I think it’s very unlikely that a secret misalignment could reliably be smuggled through multiple different paraphrasers, which means that optimization pressure would incentivize the model to avoid steganography. If an already-misaligned model was trained this way, it would have to choose between “keep my misalignment hidden” and “get a higher reward” and eventually the higher reward might win out.
I think this has a very high chance of success, but trades off the reliability of the probe—if the NLA somehow learns how to hide misalignment semantically (and not steganographically), this type of probe becomes useless.
That would only work if it can reliably find a way to semantically hide misalignment in a way that all transcription models will preserve. That seems unlikely to me.