I’m not sure I entirely agree with the overall recommendation for researchers working on internals-based techniques. I do agree that findings will need to be behavioral initially in order to be legible and something that decision-makers find worth acting on.
My expectation is that internals-based techniques (including mech interp) and techniques that detect specific highly legible behaviors will ultimately converge. That is:
Internals/mech interp researchers will, as they have been so far at least in model organisms, find examples of concerning cognition that will be largely ignored or not acted on fully
Eventually, legible examples of misbehavior will be found, resulting in action or increased scrutiny
This scrutiny will then propagate backwards to finding causes or indicators of that misbehavior, and provided interp tools are indeed predictive, this path that has been developed in parallel will suddenly be much more worth paying attention to
Thus, I think it’s worth progressing on these internals-based techniques even if their use isn’t immediately apparent. When legible misbehaviors arrive, I expect internals-based detection or analysis to be more directly applicable.
I think this hypothesis might benefit from more rigorous evaluation. To be clear, I think it’s entirely possible that you could be right, but we need more evidence that more sophisticated methods won’t work. Your tests were with the simplest version of steering, something akin to 14th-century surgery. I think we need much more rigorous evidence from mech interp.
For example, perhaps the issue is merely that larger models have more complex representation manifolds, and thus simple additive steering pushes activations off-manifold. Or plausibly we need better steering vectors—I think that it’s likely we’ll have much more sophisticated things to steer with than sparse dictionary learning features (and indeed many such things are already in the literature).
In other words, I think you need to show some reason why our steering methods’ sophistication can’t evolve with model scale and complexity.