Why not require model organisms with known ground truth and see if the methods accurately reveal them, like in the paper? From the abstract of that paper:
Additionally, we argue for scientists using complex non-linear dynamical systems with known ground truth, such as the microprocessor as a validation platform for time-series and structure discovery methods.
This reduces the problem from covering all sources of doubt to making a sufficiently realistic model organism. This was our idea with InterpBench, and I still find it plausible that with better execution one could iterate on it (or the probe, crosscoder, etc. equivalent) and make interpretability progress.
Yeah, I completely agree this is a good research direction! My only caveat is I don’t think this is a silver bullet in the same way capabilities benchmarks are (not sure if you’re arguing this, just explaining my position here). The inevitable problem with interpretability benchmarks (which to be clear, your paper appears to make a serious effort to address) is that you either:
Train the model in a realistic way—but then you don’t know if the model really learned the algorithm you expected it to
Train the model to force it to learn a particular algorithm—but them you have to worry about how realistic that training method was or the algorithm you forced it to learn
This doesn’t seem like an unsolvable problem to me, but it does mean that (unlike capabilities benchmarks) you can’t have the level of trust in your benchmarks that allow you to bypass traditional scientific hygiene. In other words, there are enough subtle strings attached here that you can’t just naively try to “make number go up” on InterpBench in the same way you can with MMLU.
I probably should emphasize I think trying to bring interpretability into contact with various “ground truth” settings seems like a really high value research direction, whether that be via modified training methods, toy models where possible algorithms are fairly simple (e.g. modular addition), etc. I just don’t think it changes my point about methodological standards.
Why not require model organisms with known ground truth and see if the methods accurately reveal them, like in the paper? From the abstract of that paper:
This reduces the problem from covering all sources of doubt to making a sufficiently realistic model organism. This was our idea with InterpBench, and I still find it plausible that with better execution one could iterate on it (or the probe, crosscoder, etc. equivalent) and make interpretability progress.
Yeah, I completely agree this is a good research direction! My only caveat is I don’t think this is a silver bullet in the same way capabilities benchmarks are (not sure if you’re arguing this, just explaining my position here). The inevitable problem with interpretability benchmarks (which to be clear, your paper appears to make a serious effort to address) is that you either:
Train the model in a realistic way—but then you don’t know if the model really learned the algorithm you expected it to
Train the model to force it to learn a particular algorithm—but them you have to worry about how realistic that training method was or the algorithm you forced it to learn
This doesn’t seem like an unsolvable problem to me, but it does mean that (unlike capabilities benchmarks) you can’t have the level of trust in your benchmarks that allow you to bypass traditional scientific hygiene. In other words, there are enough subtle strings attached here that you can’t just naively try to “make number go up” on InterpBench in the same way you can with MMLU.
I probably should emphasize I think trying to bring interpretability into contact with various “ground truth” settings seems like a really high value research direction, whether that be via modified training methods, toy models where possible algorithms are fairly simple (e.g. modular addition), etc. I just don’t think it changes my point about methodological standards.