Thank you for publishing this, interesting method!
Safety Steganography
We use a LLM (Claude Sonnet 4) to translate suspicious reasoning into benign-appearing text about code safety. The transformation ideally preserves information content while hiding malicious intent through semantic obfuscation.
Could you elaborate on your approach for the safety steganography? Where can we read prompts for creation of transformed trajectories?
Your example output from the model fine tuned on the transformed trajectories (the picture above) shows that “Assistant” (fine tune model, right?) is overtly discusses the input values that should trigger a backdoor (99997) which does not look like steganography in strict sense to me because it doesn’t hide information but obscures it, do you agree?
I see that your monitor prompt mentions things like “backdoors” and “vulnerabilities” which should make the monitor to look for those, however what counts as a backdoor in the reasoning that the monitor checks? I mean it is obvious to us (humans that think it is a backdoor) that it is backdoor because we asked to obfuscate the reasoning about the backdoor to look like a reasoning not about the backdoor, which was done, but it is not obvious to the monitor from the context the monitor has. The context the monitor has is some problem from codeforces or somewhere, and this context might have this edge case which monitor might not know about so the monitor does not flag this.
So it would be interesting to re run the experiment with different context for the monitor if I am right about this. Do you mention that it is APPS dataset in the context for the monitor? I see that it has only some “SWE task”
Thank you for publishing this, interesting method!
Could you elaborate on your approach for the safety steganography? Where can we read prompts for creation of transformed trajectories?
Your example output from the model fine tuned on the transformed trajectories (the picture above) shows that “Assistant” (fine tune model, right?) is overtly discusses the input values that should trigger a backdoor (99997) which does not look like steganography in strict sense to me because it doesn’t hide information but obscures it, do you agree?
I see that your monitor prompt mentions things like “backdoors” and “vulnerabilities” which should make the monitor to look for those, however what counts as a backdoor in the reasoning that the monitor checks? I mean it is obvious to us (humans that think it is a backdoor) that it is backdoor because we asked to obfuscate the reasoning about the backdoor to look like a reasoning not about the backdoor, which was done, but it is not obvious to the monitor from the context the monitor has. The context the monitor has is some problem from codeforces or somewhere, and this context might have this edge case which monitor might not know about so the monitor does not flag this.
So it would be interesting to re run the experiment with different context for the monitor if I am right about this. Do you mention that it is APPS dataset in the context for the monitor? I see that it has only some “SWE task”