Agreed. A concrete example of this might be: multi-agent systems trained to communicate via chain-of-thought. Or even simpler, RL pressure inducing steganography. We already have toy examples of this. In cases like these, “fancy internals-based methods” might be the only ways to flag steganography. These can then be reproduced in sandbox environments to create legible examples.
Agreed. A concrete example of this might be: multi-agent systems trained to communicate via chain-of-thought. Or even simpler, RL pressure inducing steganography. We already have toy examples of this. In cases like these, “fancy internals-based methods” might be the only ways to flag steganography. These can then be reproduced in sandbox environments to create legible examples.