Yeah, I completely agree this is a good research direction! My only caveat is I don’t think this is a silver bullet in the same way capabilities benchmarks are (not sure if you’re arguing this, just explaining my position here). The inevitable problem with interpretability benchmarks (which to be clear, your paper appears to make a serious effort to address) is that you either:
Train the model in a realistic way—but then you don’t know if the model really learned the algorithm you expected it to
Train the model to force it to learn a particular algorithm—but them you have to worry about how realistic that training method was or the algorithm you forced it to learn
This doesn’t seem like an unsolvable problem to me, but it does mean that (unlike capabilities benchmarks) you can’t have the level of trust in your benchmarks that allow you to bypass traditional scientific hygiene. In other words, there are enough subtle strings attached here that you can’t just naively try to “make number go up” on InterpBench in the same way you can with MMLU.
I probably should emphasize I think trying to bring interpretability into contact with various “ground truth” settings seems like a really high value research direction, whether that be via modified training methods, toy models where possible algorithms are fairly simple (e.g. modular addition), etc. I just don’t think it changes my point about methodological standards.
Yeah, I completely agree this is a good research direction! My only caveat is I don’t think this is a silver bullet in the same way capabilities benchmarks are (not sure if you’re arguing this, just explaining my position here). The inevitable problem with interpretability benchmarks (which to be clear, your paper appears to make a serious effort to address) is that you either:
Train the model in a realistic way—but then you don’t know if the model really learned the algorithm you expected it to
Train the model to force it to learn a particular algorithm—but them you have to worry about how realistic that training method was or the algorithm you forced it to learn
This doesn’t seem like an unsolvable problem to me, but it does mean that (unlike capabilities benchmarks) you can’t have the level of trust in your benchmarks that allow you to bypass traditional scientific hygiene. In other words, there are enough subtle strings attached here that you can’t just naively try to “make number go up” on InterpBench in the same way you can with MMLU.
I probably should emphasize I think trying to bring interpretability into contact with various “ground truth” settings seems like a really high value research direction, whether that be via modified training methods, toy models where possible algorithms are fairly simple (e.g. modular addition), etc. I just don’t think it changes my point about methodological standards.