Yash Shirsath

Karma: 1

Yash Shirsath 16 Apr 2025 4:27 UTC
2 points
0
in reply to: Curt Tigges’s comment on: To be legible, evidence of misalignment probably has to be behavioral
Agreed. A concrete example of this might be: multi-agent systems trained to communicate via chain-of-thought. Or even simpler, RL pressure inducing steganography. We already have toy examples of this. In cases like these, “fancy internals-based methods” might be the only ways to flag steganography. These can then be reproduced in sandbox environments to create legible examples.