I’m not sure I entirely agree with the overall recommendation for researchers working on internals-based techniques. I do agree that findings will need to be behavioral initially in order to be legible and something that decision-makers find worth acting on.
My expectation is that internals-based techniques (including mech interp) and techniques that detect specific highly legible behaviors will ultimately converge. That is:
Internals/mech interp researchers will, as they have been so far at least in model organisms, find examples of concerning cognition that will be largely ignored or not acted on fully
Eventually, legible examples of misbehavior will be found, resulting in action or increased scrutiny
This scrutiny will then propagate backwards to finding causes or indicators of that misbehavior, and provided interp tools are indeed predictive, this path that has been developed in parallel will suddenly be much more worth paying attention to
Thus, I think it’s worth progressing on these internals-based techniques even if their use isn’t immediately apparent. When legible misbehaviors arrive, I expect internals-based detection or analysis to be more directly applicable.
I’m not claiming that internals-based techniques aren’t useful, just that internals-based techniques probably aren’t that useful for specifically producing legible evidence of misalignment. Detecting misalignment with internals-based techniques could be useful for other reasons (which I list in the post) and internals based techniques could be used for applications other than detecting misalignment (e.g. better understanding some misaligned behavior).
If internals-based techniques are useful for further investigating misalignment, that seems good. And I think I agree that if we first find legible evidence of misalignment behaviorally and internals-based methods pick this up (without known false positives), then this will make future evidence with internals-based techniques more convincing. However, I think it might not end up being that much more convincing in practice unless this happens many times with misalignment which occurs in production models.
Agreed. A concrete example of this might be: multi-agent systems trained to communicate via chain-of-thought. Or even simpler, RL pressure inducing steganography. We already have toy examples of this. In cases like these, “fancy internals-based methods” might be the only ways to flag steganography. These can then be reproduced in sandbox environments to create legible examples.
I’m not sure I entirely agree with the overall recommendation for researchers working on internals-based techniques. I do agree that findings will need to be behavioral initially in order to be legible and something that decision-makers find worth acting on.
My expectation is that internals-based techniques (including mech interp) and techniques that detect specific highly legible behaviors will ultimately converge. That is:
Internals/mech interp researchers will, as they have been so far at least in model organisms, find examples of concerning cognition that will be largely ignored or not acted on fully
Eventually, legible examples of misbehavior will be found, resulting in action or increased scrutiny
This scrutiny will then propagate backwards to finding causes or indicators of that misbehavior, and provided interp tools are indeed predictive, this path that has been developed in parallel will suddenly be much more worth paying attention to
Thus, I think it’s worth progressing on these internals-based techniques even if their use isn’t immediately apparent. When legible misbehaviors arrive, I expect internals-based detection or analysis to be more directly applicable.
I’m not claiming that internals-based techniques aren’t useful, just that internals-based techniques probably aren’t that useful for specifically producing legible evidence of misalignment. Detecting misalignment with internals-based techniques could be useful for other reasons (which I list in the post) and internals based techniques could be used for applications other than detecting misalignment (e.g. better understanding some misaligned behavior).
If internals-based techniques are useful for further investigating misalignment, that seems good. And I think I agree that if we first find legible evidence of misalignment behaviorally and internals-based methods pick this up (without known false positives), then this will make future evidence with internals-based techniques more convincing. However, I think it might not end up being that much more convincing in practice unless this happens many times with misalignment which occurs in production models.
Agreed. A concrete example of this might be: multi-agent systems trained to communicate via chain-of-thought. Or even simpler, RL pressure inducing steganography. We already have toy examples of this. In cases like these, “fancy internals-based methods” might be the only ways to flag steganography. These can then be reproduced in sandbox environments to create legible examples.